slice A · competitive matrix

six engines, three scales, four metrics

How does each engine scale on the same corpus as the dataset grows? Slice A fixes the effort knob (skeg l_search=300, qdrant ef=128, chroma ef=128), runs each engine three times at 100K, 500K, and 1M vectors, and reports the median. The same single corpus is used by everyone: Simple English Wikipedia passages, embedded with mxbai-embed-large-v1 1024d.

data provenance

Where the numbers come from. Same source, same generator, same ground truth for every engine in the comparison.

corpus

Simple English Wikipedia, passages ≥ 500 chars truncated to ~400 chars. Public dump preprocessed once and frozen for the bench.

embedder

mxbai-embed-large-v1, 1024 dimensions. sentence-transformers via Apple Metal (MPS). Same model used to embed corpus and queries (no cross-model leakage).

queries

1 000 hold-out passages, disjoint from corpus

ground truth

Top-100 nearest neighbours computed with exact brute-force cosine over float32 vectors. Computed once per scale, reused by every engine, frozen as a parquet next to the corpus.

recall @ 10

Of the engine's top-10 nearest neighbours, how many appear in the brute-force exact top-10? Higher is better. Above 0.99 the engines are functionally identical; below 0.95 you're starting to miss material.

recall @ 100

A tougher recall test: of the top-100, how many are real top-100? This punishes engines whose quantization is too aggressive (qdrant-pq in particular bleeds quality here).

disk vs RAM · how much of the index is actually loaded

Same dataset on disk for everyone (mxbai 1024d, ~4 GB at 1M). The question is: how much of that ends up resident in RAM during queries? HNSW backends (qdrant, chroma) keep the full index in RAM by design: their RSS tracks disk size almost 1:1. skeg is SSD-primary: PQ-128 codes are 32× smaller than f32 vectors, only hot pages of the Vamana graph are kept warm, and the OS evicts the rest. At 1M, skeg-pq128 holds about 10% of the index in RAM while qdrant-hnsw holds 96%. Same dataset, same recall, ten times less memory budget consumed.

RSS at steady state

Resident memory of the engine process during the query loop. On a personal-AI machine the LLM owns most of the RAM · the vector store should not contend for it. Lower is better, and the gap matters because each MiB the store doesn't use is one the LLM can keep warm.

query latency p99

99th-percentile single-query latency at concurrency=1. Includes protocol overhead, not just search; see slice B for the recall/latency frontier per engine.

throughput at concurrency = 1

Sustained QPS in a tight single-client loop. Higher is better. skeg's binary protocol gives it a transport edge here; for fairness, treat these numbers as the single-machine ceiling · see also slice C for how each engine scales under concurrent load.

build time

Time to ingest the corpus and build the index. One-shot cost, paid once per dataset. Less interesting in production (you do it at prep-time) but useful for reproducibility.

all numbers

Click headers to sort.

engine	scale	n vectors	recall@10	recall@100	p50 µs	p99 µs	qps	rss MiB	build s	disk MiB
chroma-hnsw	100k-minilm	100,000	0.9765	0.9247	2736	3622	354	501.8	25.8	178.5
qdrant-hnsw	100k-minilm	100,000	0.9855	0.9666	2006	2627	483	536.9	26.6	625.0
qdrant-pq	100k-minilm	100,000	0.7945	0.8279	2155	3218	449	389.2	37.3	511.0
qdrant-sq	100k-minilm	100,000	0.9580	0.9481	1421	1937	677	561.2	25.7	596.2
skeg-int8	100k-minilm	100,000	0.9975	0.9950	1236	1600	806	73.4	32.1	172.0
skeg-pq128	100k-minilm	100,000	0.9970	0.9960	1842	2418	542	50.4	31.6	172.0
chroma-hnsw	100k	100,000	0.9900	0.9594	3897	4928	253	734.6	41.8	425.1
qdrant-hnsw	100k	100,000	0.9950	0.9820	2631	3275	370	877.8	63.4	942.5
qdrant-pq	100k	100,000	0.8395	0.8722	2523	3103	385	647.9	95.0	757.6
qdrant-sq	100k	100,000	0.9630	0.9677	2147	2684	455	1004.4	61.8	1074.6
skeg-int8	100k	100,000	1.0000	0.9973	1688	2344	590	130.8	51.9	416.2
skeg-pq128	100k	100,000	0.9995	0.9833	1681	2362	586	62.9	52.2	416.2
chroma-hnsw	500k	500,000	0.9857	0.9379	4408	7438	226	2388.4	353.5	2109.4
qdrant-hnsw	500k	500,000	0.9880	0.9475	2890	3686	343	2337.1	324.8	2351.8
qdrant-pq	500k	500,000	0.7812	0.8413	2501	3172	395	2405.1	347.8	2453.8
qdrant-sq	500k	500,000	0.9547	0.9497	2221	3175	432	3013.7	309.6	3064.8
skeg-int8	500k	500,000	0.9994	0.9963	1724	2963	572	635.2	364.1	2080.9
skeg-pq128	500k	500,000	0.9994	0.9683	1906	3060	519	227.6	365.0	2080.9
chroma-hnsw	1m	1,000,000	0.9831	0.9278	5576	19581	160	3345.4	1976.8	4214.8
qdrant-hnsw	1m	1,000,000	0.9922	0.9697	3737	43817	159	4146.8	649.2	4340.0
qdrant-pq	1m	1,000,000	0.7755	0.8272	2648	4267	367	2920.9	700.6	4569.7
qdrant-sq	1m	1,000,000	0.9521	0.9516	2411	5276	390	3125.4	627.6	5290.5
skeg-int8	1m	1,000,000	0.9989	0.9944	4535	15612	193	1254.6	891.7	4161.8
skeg-pq128	1m	1,000,000	0.9990	0.9631	2001	3624	471	419.0	903.8	4161.8

Dataset: Wikipedia Simple English, passages ≥500 chars, first ~400 chars per chunk. mxbai-embed-large-v1 via sentence-transformers MPS.
Engines & knobs: skeg-pq128 / skeg-int8 at l_search=300; qdrant-hnsw / qdrant-sq at ef=128; qdrant-pq at ef=128 nprobes=64; chroma-hnsw at default ef=128.
LanceDB: excluded from the public charts due to a known config issue that produced badly degraded recall in this run (~0.68). The tier-mode and disk-layout knobs need a careful rerun before LanceDB numbers can be compared fairly.
Sweep: 3 runs × (engine × scale), median reported. Concurrency 1.
Replication: the same matrix was also run with the smaller minilm-l6-v2 embedder (384d); results are consistent · skeg holds the RAM advantage at lower dimensionality too.