slice C · concurrency under load
does the engine scale with concurrent clients?
Same scale as slice A's smallest point (N = 100K), same effort knob, now with a sweep of 1 / 4 / 16 / 64 concurrent clients. Where does throughput saturate? How does p99 latency degrade as the queue depth grows? Each engine is single-process here; this is the single-machine ceiling, not a cluster benchmark.
data provenance
Where the numbers come from. Same source, same generator, same ground truth for every engine in the comparison.
Simple English Wikipedia, passages ≥ 500 chars truncated to ~400 chars. Public dump preprocessed once and frozen for the bench.
mxbai-embed-large-v1, 1024 dimensions. sentence-transformers via Apple Metal (MPS). Same model used to embed corpus and queries (no cross-model leakage).
1 000 hold-out passages (same set as slice A), each issued concurrently by 1, 4, 16, or 64 simulated clients.
Top-100 nearest neighbours computed with exact brute-force cosine over float32 vectors. Computed once per scale, reused by every engine, frozen as a parquet next to the corpus.
throughput as clients pile on
QPS as concurrency grows from 1 to 64. The plateau is the single-machine ceiling for that engine · adding more clients past that point only deepens the queue. Where each engine's line flattens tells you how much concurrency the single shard can usefully absorb.
p99 latency under contention
Log scale. Past the saturation point above, p99 grows roughly linearly with concurrency (more clients = longer queue). The absolute number depends on protocol overhead; what matters is the shape · a sharp inflection would mean contention on a shared resource, not orderly queueing.
recall stays put
Recall@10 across concurrency levels. A flat line is the desired outcome: throughput pressure should not bend the recall curve. All engines deliver on this · the search algorithm is deterministic, the only thing changing is how fast results come out.
all numbers
Click headers to sort.
| engine | scale | knob | value | concurrency | recall@10 | p50 µs | p99 µs | qps | rss MiB |
|---|---|---|---|---|---|---|---|---|---|
| chroma-hnsw | 100k | ef | 128 | 1 | 0.9883 | 3989 | 5227 | 245 | 848.0 |
| chroma-hnsw | 100k | ef | 128 | 4 | 0.9875 | 12161 | 14185 | 319 | 856.8 |
| chroma-hnsw | 100k | ef | 128 | 16 | 0.9888 | 48206 | 54322 | 301 | 763.3 |
| chroma-hnsw | 100k | ef | 128 | 64 | 0.9880 | 103159 | 195039 | 296 | 706.3 |
| qdrant-hnsw | 100k | ef | 128 | 1 | 0.9963 | 2740 | 3186 | 358 | 908.5 |
| qdrant-hnsw | 100k | ef | 128 | 4 | 0.9960 | 5595 | 9653 | 623 | 897.9 |
| qdrant-hnsw | 100k | ef | 128 | 16 | 0.9912 | 18910 | 41414 | 536 | 776.9 |
| qdrant-hnsw | 100k | ef | 128 | 64 | 0.9940 | 25075 | 143002 | 261 | 906.8 |
| qdrant-pq | 100k | ef | 128 | 1 | 0.8380 | 2521 | 3340 | 370 | 646.2 |
| qdrant-pq | 100k | ef | 128 | 4 | 0.8405 | 5215 | 10574 | 656 | 651.0 |
| qdrant-pq | 100k | ef | 128 | 16 | 0.8273 | 19220 | 40293 | 511 | 676.4 |
| qdrant-pq | 100k | ef | 128 | 64 | 0.8402 | 20882 | 68766 | 268 | 662.8 |
| qdrant-sq | 100k | ef | 128 | 1 | 0.9657 | 2178 | 2727 | 448 | 1020.9 |
| qdrant-sq | 100k | ef | 128 | 4 | 0.9640 | 4994 | 10014 | 696 | 1014.0 |
| qdrant-sq | 100k | ef | 128 | 16 | 0.9660 | 17284 | 42184 | 528 | 1006.6 |
| qdrant-sq | 100k | ef | 128 | 64 | 0.9647 | 15742 | 56861 | 277 | 1029.3 |
| skeg-int8 | 100k | l_search | 300 | 1 | 0.9998 | 1651 | 2307 | 601 | 130.8 |
| skeg-int8 | 100k | l_search | 300 | 4 | 0.9998 | 6169 | 7497 | 644 | 131.1 |
| skeg-int8 | 100k | l_search | 300 | 16 | 0.9995 | 23885 | 26145 | 658 | 132.0 |
| skeg-int8 | 100k | l_search | 300 | 64 | 1.0000 | 93924 | 103949 | 642 | 133.8 |
| skeg-pq128 | 100k | l_search | 300 | 1 | 0.9995 | 1681 | 2378 | 584 | 59.0 |
| skeg-pq128 | 100k | l_search | 300 | 4 | 0.9995 | 6309 | 7542 | 624 | 55.2 |
| skeg-pq128 | 100k | l_search | 300 | 16 | 0.9998 | 26462 | 33311 | 588 | 57.7 |
| skeg-pq128 | 100k | l_search | 300 | 64 | 0.9998 | 96594 | 100199 | 640 | 58.5 |
methodology in one minute
- Scale: 100K vectors fixed, single corpus from slice A.
- Concurrency: 1 / 4 / 16 / 64 simultaneous clients.
- Effort knob: each engine at its default for this scale (skeg
l_search=300, qdrantef=128, chromaef=128). - Repetitions: 2 per (engine, concurrency); median tabulated.
- Note: first run hit a file-descriptor limit on chroma at c=64. Re-run with
ulimit -n 65536+ resume mode in the harness.