skeg
benchmarks

slice D · co-residence

vector store + LLM on the same machine

Slices A/B/C measure each engine in isolation. Slice D adds the thing that matters on a personal-AI laptop: an LLM is also running. Llama 3.2 3B (Q4_K_M, ~2.2 GB) answers 50 RAG turns while each vector store serves retrievals from a corpus that grows from 10K to 1M. The question isn't "which engine is fastest" · it's "which one lets the LLM finish its sentence".

data provenance

Where the numbers come from. Same source, same generator, same ground truth for every engine in the comparison.

corpus

English Wikipedia (full) chunked at 400 chars on sentence boundaries. Same pool used across all corpus sizes (10K, 25K, 50K… 1M) by sequential slicing.

embedder

mxbai-embed-large-v1, 1024 dimensions. sentence-transformers via Apple Metal (MPS). Same model used to embed corpus and queries (no cross-model leakage).

queries

50 hold-out passages re-embedded live each turn with mxbai-embed-large (the slice intentionally pays the embedder cost too).

ground truth

No retrieval recall measured here; this slice answers a different question (memory pressure, throughput under co-residence).

notes

Passages: the text shown to the LLM as RAG context was re-fetched from Wikimedia and re-chunked at the same 400-char target since the original chunking script was lost. The text at id N is not guaranteed to be the exact chunk that produced embedding row N, but the corpus pool is the same domain at the same chunking granularity. For the metric measured here (memory pressure under joint LLM+vector-store load) this is fine; recall fidelity is covered by slices A and B against frozen ground truth.

Thermals: not captured for this run (sudo cache expired before powermetrics sampling started). The table shows cpu_idle_p50 and load_avg_1m_p95 so you can see how loaded the system was during each cell; future reruns will surface die temps and fan RPM when sudo holds for the whole window.

RAM cost while the LLM is running

Resident memory of the vector-store process during the RAG loop, reported as p95 of the 500 ms samples taken across the 50-turn window (~9 minutes per cell, ~1 000 samples). p95 means "95% of the time the engine sat at or below this value" · this filters out single-sample spikes while still reporting the working set under sustained use. At 1M vectors: skeg sits at ~63 MB, qdrant at ~1 893 MB (30× more). Same LLM resident in 2.2 GB, same corpus, same query stream. Chroma is in-process so it doesn't appear here (its memory is folded into the Python orchestrator).

adaptive working set · sustained vs RAG load

The same engine, same 1M corpus, two workloads: slice A drove a sustained query loop (~590 QPS, cache stays hot); slice D drove a low-rate RAG loop (~1 query every ~4 s, long idle gaps between turns). The chart shows how much RAM each engine kept resident in each regime.

Why skeg collapses 7× (419 → 63 MB) and qdrant only halves (4 147 → 1 893 MB): skeg is SSD-primary. When queries thin out, the OS evicts cold pages of the PQ-128 codes and the Vamana graph; the next query pulls them back from SSD on demand. The working set adapts to the access pattern. qdrant-hnsw needs the whole HNSW graph plus the full f32 vectors resident to serve any query, so it can't release as much · its working set is bounded below by the index size, not the access pattern.

Both numbers are real. Production high-QPS: skeg 419 MB vs qdrant 4 147 MB (10× gap). Personal-AI typical RAG: skeg 63 MB vs qdrant 1 893 MB (30× gap, because qdrant can't compress the way skeg can).

LLM throughput

What this is: tokens per second the LLM sustains during the 50-turn loop, median across the loop. The LLM is the same model and the same query stream regardless of backend; the only thing changing is which vector store is answering retrieval.

How to read it: at this size of model (3B, ~2.2 GiB resident) on a 16 GiB M1 the system has just enough headroom that backend choice does not move tokens/sec noticeably. This is not skeg's win on this chart: the system isn't saturated yet, so everyone keeps the LLM fed. The story shifts when you swap in an 8B+ model or run on a tighter budget · the memory chart above is the leading indicator for that.

retrieval latency

Per-query latency p99. Log Y · backends differ by order of magnitude. This number is dominated by protocol overhead, not search; see slice B for the recall/latency frontier.

all numbers

Click headers to sort.

corpus backend turns tps p50 retr p50 ms retr p99 ms backend rss p50 backend rss max compr p95 MB press p95 % cpu idle p50 % load 1m p95
10,000 skeg 147 31.0 2.0 3.6 7 MB 9 MB 1280 75 86 2.7
10,000 qdrant 147 31.7 4.9 7.9 119 MB 230 MB 1310 74 85 4.4
10,000 chroma 147 29.7 5.4 27.0 - - 1331 76 72 28.9
10,000 baseline 49 28.5 0.0 0.0 - - 1304 74 77 28.9
25,000 skeg 147 31.3 2.4 3.6 9 MB 25 MB 1187 70 86 6.0
25,000 qdrant 147 30.1 7.3 29.1 121 MB 430 MB 1257 76 85 4.1
25,000 chroma 147 31.0 5.9 12.5 - - 1378 77 78 21.8
25,000 baseline 49 28.4 0.0 0.0 - - 1429 77 68 28.1
50,000 skeg 147 28.1 3.1 5.8 12 MB 36 MB 1224 74 90 5.3
50,000 qdrant 147 27.5 5.5 9.4 359 MB 663 MB 1243 75 92 2.1
50,000 chroma 147 28.8 6.5 11.8 - - 1226 74 78 17.1
50,000 baseline 49 28.6 0.0 0.0 - - 1422 75 78 12.1
100,000 skeg 147 29.9 3.2 6.0 17 MB 30 MB 1239 75 92 10.7
100,000 qdrant 147 30.0 5.0 12.8 405 MB 740 MB 1263 74 89 4.6
100,000 chroma 147 30.8 7.1 32.7 - - 1323 77 77 17.1
100,000 baseline 49 28.3 0.0 0.0 - - 1497 75 78 19.5
200,000 skeg 147 27.4 2.5 4.8 27 MB 46 MB 1014 70 90 57.6
200,000 qdrant 147 29.8 5.6 12.0 298 MB 1278 MB 1090 71 92 4.7
200,000 chroma 147 30.3 6.8 22.3 - - 1356 76 78 10.6
200,000 baseline 49 29.3 0.0 0.0 - - 1427 74 86 3.0
350,000 skeg 147 32.5 3.3 6.8 38 MB 45 MB 895 65 89 54.8
350,000 qdrant 147 30.7 5.9 25.4 785 MB 1531 MB 930 67 87 9.1
350,000 chroma 147 31.3 6.6 23.1 - - 1390 76 86 5.3
350,000 baseline 49 30.0 0.0 0.0 - - 1468 76 84 3.9
500,000 skeg 147 32.2 3.6 6.6 51 MB 220 MB 621 58 85 19.7
500,000 qdrant 147 33.2 5.8 37.9 1659 MB 3530 MB 889 63 86 7.2
500,000 chroma 147 31.3 7.9 23.5 - - 1377 75 78 14.2
500,000 baseline 49 29.2 0.0 0.0 - - 1414 76 69 16.4
750,000 skeg 147 31.8 4.3 18.6 66 MB 79 MB 742 62 84 55.2
750,000 qdrant 147 31.8 9.7 53.6 580 MB 2120 MB 1147 73 86 12.1
750,000 chroma 147 33.2 10.8 18.1 - - 1414 76 85 5.8
750,000 baseline 49 28.4 0.0 0.0 - - 1394 74 86 2.5
1,000,000 skeg 147 30.7 5.5 24.4 54 MB 67 MB 1358 77 83 52.9
1,000,000 qdrant 147 30.5 21.0 551.7 253 MB 2387 MB 1272 76 79 14.7
1,000,000 chroma 147 31.5 12.1 21.2 - - 1399 76 85 4.5
1,000,000 baseline 49 29.5 0.0 0.0 - - 1395 77 85 2.7

methodology in one minute