slice C · concurrency under load

does the engine scale with concurrent clients?

Same scale as slice A's smallest point (N = 100K), same effort knob, now with a sweep of 1 / 4 / 16 / 64 concurrent clients. Where does throughput saturate? How does p99 latency degrade as the queue depth grows? Each engine is single-process here; this is the single-machine ceiling, not a cluster benchmark.

data provenance

Where the numbers come from. Same source, same generator, same ground truth for every engine in the comparison.

corpus

Simple English Wikipedia, passages ≥ 500 chars truncated to ~400 chars. Public dump preprocessed once and frozen for the bench.

embedder

mxbai-embed-large-v1, 1024 dimensions. sentence-transformers via Apple Metal (MPS). Same model used to embed corpus and queries (no cross-model leakage).

queries

1 000 hold-out passages (same set as slice A), each issued concurrently by 1, 4, 16, or 64 simulated clients.

ground truth

Top-100 nearest neighbours computed with exact brute-force cosine over float32 vectors. Computed once per scale, reused by every engine, frozen as a parquet next to the corpus.

throughput as clients pile on

QPS as concurrency grows from 1 to 64. The plateau is the single-machine ceiling for that engine · adding more clients past that point only deepens the queue. Where each engine's line flattens tells you how much concurrency the single shard can usefully absorb.

p99 latency under contention

Log scale. Past the saturation point above, p99 grows roughly linearly with concurrency (more clients = longer queue). The absolute number depends on protocol overhead; what matters is the shape · a sharp inflection would mean contention on a shared resource, not orderly queueing.

recall stays put

Recall@10 across concurrency levels. A flat line is the desired outcome: throughput pressure should not bend the recall curve. All engines deliver on this · the search algorithm is deterministic, the only thing changing is how fast results come out.

all numbers

Click headers to sort.

engine	scale	knob	value	concurrency	recall@10	p50 µs	p99 µs	qps	rss MiB
chroma-hnsw	100k	ef	128	1	0.9883	3989	5227	245	848.0
chroma-hnsw	100k	ef	128	4	0.9875	12161	14185	319	856.8
chroma-hnsw	100k	ef	128	16	0.9888	48206	54322	301	763.3
chroma-hnsw	100k	ef	128	64	0.9880	103159	195039	296	706.3
qdrant-hnsw	100k	ef	128	1	0.9963	2740	3186	358	908.5
qdrant-hnsw	100k	ef	128	4	0.9960	5595	9653	623	897.9
qdrant-hnsw	100k	ef	128	16	0.9912	18910	41414	536	776.9
qdrant-hnsw	100k	ef	128	64	0.9940	25075	143002	261	906.8
qdrant-pq	100k	ef	128	1	0.8380	2521	3340	370	646.2
qdrant-pq	100k	ef	128	4	0.8405	5215	10574	656	651.0
qdrant-pq	100k	ef	128	16	0.8273	19220	40293	511	676.4
qdrant-pq	100k	ef	128	64	0.8402	20882	68766	268	662.8
qdrant-sq	100k	ef	128	1	0.9657	2178	2727	448	1020.9
qdrant-sq	100k	ef	128	4	0.9640	4994	10014	696	1014.0
qdrant-sq	100k	ef	128	16	0.9660	17284	42184	528	1006.6
qdrant-sq	100k	ef	128	64	0.9647	15742	56861	277	1029.3
skeg-int8	100k	l_search	300	1	0.9998	1651	2307	601	130.8
skeg-int8	100k	l_search	300	4	0.9998	6169	7497	644	131.1
skeg-int8	100k	l_search	300	16	0.9995	23885	26145	658	132.0
skeg-int8	100k	l_search	300	64	1.0000	93924	103949	642	133.8
skeg-pq128	100k	l_search	300	1	0.9995	1681	2378	584	59.0
skeg-pq128	100k	l_search	300	4	0.9995	6309	7542	624	55.2
skeg-pq128	100k	l_search	300	16	0.9998	26462	33311	588	57.7
skeg-pq128	100k	l_search	300	64	0.9998	96594	100199	640	58.5

methodology in one minute

Scale: 100K vectors fixed, single corpus from slice A.
Concurrency: 1 / 4 / 16 / 64 simultaneous clients.
Effort knob: each engine at its default for this scale (skeg l_search=300, qdrant ef=128, chroma ef=128).
Repetitions: 2 per (engine, concurrency); median tabulated.
Note: first run hit a file-descriptor limit on chroma at c=64. Re-run with ulimit -n 65536 + resume mode in the harness.