slice D · co-residence
vector store + LLM on the same machine
Slices A/B/C measure each engine in isolation. Slice D adds the thing that matters on a personal-AI laptop: an LLM is also running. Llama 3.2 3B (Q4_K_M, ~2.2 GB) answers 50 RAG turns while each vector store serves retrievals from a corpus that grows from 10K to 1M. The question isn't "which engine is fastest" · it's "which one lets the LLM finish its sentence".
data provenance
Where the numbers come from. Same source, same generator, same ground truth for every engine in the comparison.
English Wikipedia (full) chunked at 400 chars on sentence boundaries. Same pool used across all corpus sizes (10K, 25K, 50K… 1M) by sequential slicing.
mxbai-embed-large-v1, 1024 dimensions. sentence-transformers via Apple Metal (MPS). Same model used to embed corpus and queries (no cross-model leakage).
50 hold-out passages re-embedded live each turn with mxbai-embed-large (the slice intentionally pays the embedder cost too).
No retrieval recall measured here; this slice answers a different question (memory pressure, throughput under co-residence).
Passages: the text shown to the LLM as RAG context was re-fetched from Wikimedia and re-chunked at the same 400-char target since the original chunking script was lost. The text at id N is not guaranteed to be the exact chunk that produced embedding row N, but the corpus pool is the same domain at the same chunking granularity. For the metric measured here (memory pressure under joint LLM+vector-store load) this is fine; recall fidelity is covered by slices A and B against frozen ground truth.
Thermals: not captured for this run (sudo cache expired before powermetrics sampling started). The table shows cpu_idle_p50 and load_avg_1m_p95 so you can see how loaded the system was during each cell; future reruns will surface die temps and fan RPM when sudo holds for the whole window.
RAM cost while the LLM is running
Resident memory of the vector-store process during the RAG loop, reported as p95 of the 500 ms samples taken across the 50-turn window (~9 minutes per cell, ~1 000 samples). p95 means "95% of the time the engine sat at or below this value" · this filters out single-sample spikes while still reporting the working set under sustained use. At 1M vectors: skeg sits at ~63 MB, qdrant at ~1 893 MB (30× more). Same LLM resident in 2.2 GB, same corpus, same query stream. Chroma is in-process so it doesn't appear here (its memory is folded into the Python orchestrator).
adaptive working set · sustained vs RAG load
The same engine, same 1M corpus, two workloads: slice A drove a sustained query loop (~590 QPS, cache stays hot); slice D drove a low-rate RAG loop (~1 query every ~4 s, long idle gaps between turns). The chart shows how much RAM each engine kept resident in each regime.
Why skeg collapses 7× (419 → 63 MB) and qdrant only halves (4 147 → 1 893 MB): skeg is SSD-primary. When queries thin out, the OS evicts cold pages of the PQ-128 codes and the Vamana graph; the next query pulls them back from SSD on demand. The working set adapts to the access pattern. qdrant-hnsw needs the whole HNSW graph plus the full f32 vectors resident to serve any query, so it can't release as much · its working set is bounded below by the index size, not the access pattern.
Both numbers are real. Production high-QPS: skeg 419 MB vs qdrant 4 147 MB (10× gap). Personal-AI typical RAG: skeg 63 MB vs qdrant 1 893 MB (30× gap, because qdrant can't compress the way skeg can).
LLM throughput
What this is: tokens per second the LLM sustains during the 50-turn loop, median across the loop. The LLM is the same model and the same query stream regardless of backend; the only thing changing is which vector store is answering retrieval.
How to read it: at this size of model (3B, ~2.2 GiB resident) on a 16 GiB M1 the system has just enough headroom that backend choice does not move tokens/sec noticeably. This is not skeg's win on this chart: the system isn't saturated yet, so everyone keeps the LLM fed. The story shifts when you swap in an 8B+ model or run on a tighter budget · the memory chart above is the leading indicator for that.
retrieval latency
Per-query latency p99. Log Y · backends differ by order of magnitude. This number is dominated by protocol overhead, not search; see slice B for the recall/latency frontier.
all numbers
Click headers to sort.
| corpus | backend | turns | tps p50 | retr p50 ms | retr p99 ms | backend rss p50 | backend rss max | compr p95 MB | press p95 % | cpu idle p50 % | load 1m p95 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 10,000 | skeg | 147 | 31.0 | 2.0 | 3.6 | 7 MB | 9 MB | 1280 | 75 | 86 | 2.7 |
| 10,000 | qdrant | 147 | 31.7 | 4.9 | 7.9 | 119 MB | 230 MB | 1310 | 74 | 85 | 4.4 |
| 10,000 | chroma | 147 | 29.7 | 5.4 | 27.0 | - | - | 1331 | 76 | 72 | 28.9 |
| 10,000 | baseline | 49 | 28.5 | 0.0 | 0.0 | - | - | 1304 | 74 | 77 | 28.9 |
| 25,000 | skeg | 147 | 31.3 | 2.4 | 3.6 | 9 MB | 25 MB | 1187 | 70 | 86 | 6.0 |
| 25,000 | qdrant | 147 | 30.1 | 7.3 | 29.1 | 121 MB | 430 MB | 1257 | 76 | 85 | 4.1 |
| 25,000 | chroma | 147 | 31.0 | 5.9 | 12.5 | - | - | 1378 | 77 | 78 | 21.8 |
| 25,000 | baseline | 49 | 28.4 | 0.0 | 0.0 | - | - | 1429 | 77 | 68 | 28.1 |
| 50,000 | skeg | 147 | 28.1 | 3.1 | 5.8 | 12 MB | 36 MB | 1224 | 74 | 90 | 5.3 |
| 50,000 | qdrant | 147 | 27.5 | 5.5 | 9.4 | 359 MB | 663 MB | 1243 | 75 | 92 | 2.1 |
| 50,000 | chroma | 147 | 28.8 | 6.5 | 11.8 | - | - | 1226 | 74 | 78 | 17.1 |
| 50,000 | baseline | 49 | 28.6 | 0.0 | 0.0 | - | - | 1422 | 75 | 78 | 12.1 |
| 100,000 | skeg | 147 | 29.9 | 3.2 | 6.0 | 17 MB | 30 MB | 1239 | 75 | 92 | 10.7 |
| 100,000 | qdrant | 147 | 30.0 | 5.0 | 12.8 | 405 MB | 740 MB | 1263 | 74 | 89 | 4.6 |
| 100,000 | chroma | 147 | 30.8 | 7.1 | 32.7 | - | - | 1323 | 77 | 77 | 17.1 |
| 100,000 | baseline | 49 | 28.3 | 0.0 | 0.0 | - | - | 1497 | 75 | 78 | 19.5 |
| 200,000 | skeg | 147 | 27.4 | 2.5 | 4.8 | 27 MB | 46 MB | 1014 | 70 | 90 | 57.6 |
| 200,000 | qdrant | 147 | 29.8 | 5.6 | 12.0 | 298 MB | 1278 MB | 1090 | 71 | 92 | 4.7 |
| 200,000 | chroma | 147 | 30.3 | 6.8 | 22.3 | - | - | 1356 | 76 | 78 | 10.6 |
| 200,000 | baseline | 49 | 29.3 | 0.0 | 0.0 | - | - | 1427 | 74 | 86 | 3.0 |
| 350,000 | skeg | 147 | 32.5 | 3.3 | 6.8 | 38 MB | 45 MB | 895 | 65 | 89 | 54.8 |
| 350,000 | qdrant | 147 | 30.7 | 5.9 | 25.4 | 785 MB | 1531 MB | 930 | 67 | 87 | 9.1 |
| 350,000 | chroma | 147 | 31.3 | 6.6 | 23.1 | - | - | 1390 | 76 | 86 | 5.3 |
| 350,000 | baseline | 49 | 30.0 | 0.0 | 0.0 | - | - | 1468 | 76 | 84 | 3.9 |
| 500,000 | skeg | 147 | 32.2 | 3.6 | 6.6 | 51 MB | 220 MB | 621 | 58 | 85 | 19.7 |
| 500,000 | qdrant | 147 | 33.2 | 5.8 | 37.9 | 1659 MB | 3530 MB | 889 | 63 | 86 | 7.2 |
| 500,000 | chroma | 147 | 31.3 | 7.9 | 23.5 | - | - | 1377 | 75 | 78 | 14.2 |
| 500,000 | baseline | 49 | 29.2 | 0.0 | 0.0 | - | - | 1414 | 76 | 69 | 16.4 |
| 750,000 | skeg | 147 | 31.8 | 4.3 | 18.6 | 66 MB | 79 MB | 742 | 62 | 84 | 55.2 |
| 750,000 | qdrant | 147 | 31.8 | 9.7 | 53.6 | 580 MB | 2120 MB | 1147 | 73 | 86 | 12.1 |
| 750,000 | chroma | 147 | 33.2 | 10.8 | 18.1 | - | - | 1414 | 76 | 85 | 5.8 |
| 750,000 | baseline | 49 | 28.4 | 0.0 | 0.0 | - | - | 1394 | 74 | 86 | 2.5 |
| 1,000,000 | skeg | 147 | 30.7 | 5.5 | 24.4 | 54 MB | 67 MB | 1358 | 77 | 83 | 52.9 |
| 1,000,000 | qdrant | 147 | 30.5 | 21.0 | 551.7 | 253 MB | 2387 MB | 1272 | 76 | 79 | 14.7 |
| 1,000,000 | chroma | 147 | 31.5 | 12.1 | 21.2 | - | - | 1399 | 76 | 85 | 4.5 |
| 1,000,000 | baseline | 49 | 29.5 | 0.0 | 0.0 | - | - | 1395 | 77 | 85 | 2.7 |
methodology in one minute
- LLM: Llama 3.2 3B Q4_K_M via Ollama (~2.2 GB resident).
- Embedder: mxbai-embed-large (1024d).
- Corpus: enwiki chunked 400 chars, mxbai embeddings. Sweep 10k → 1M.
- Workload: 50-turn conversational RAG (embed → top-10 retrieval → 200-tok gen).
- Backends: skeg (PQ-128, RESP3), Qdrant (HNSW default), Chroma (in-process). One run at a time, 60 s cooldown.
- Baseline: LLM-only with random passages from the same pool. No vector store running.
- Repetitions: 3 per cell. Median + p10/p90 reported.
- Machine state: as-is. No process kill. Other apps stay open.
- Sampler: 500 ms interval, PID-injected process labels (avoids substring false positives), system metrics via
vm_stat+memory_pressure.