slice D · co-residence

vector store + LLM on the same machine

Slices A/B/C measure each engine in isolation. Slice D adds the thing that matters on a personal-AI laptop: an LLM is also running. Llama 3.2 3B (Q4_K_M, ~2.2 GB) answers 50 RAG turns while each vector store serves retrievals from a corpus that grows from 10K to 1M. The question isn't "which engine is fastest" · it's "which one lets the LLM finish its sentence".

data provenance

Where the numbers come from. Same source, same generator, same ground truth for every engine in the comparison.

corpus

English Wikipedia (full) chunked at 400 chars on sentence boundaries. Same pool used across all corpus sizes (10K, 25K, 50K… 1M) by sequential slicing.

embedder

mxbai-embed-large-v1, 1024 dimensions. sentence-transformers via Apple Metal (MPS). Same model used to embed corpus and queries (no cross-model leakage).

queries

50 hold-out passages re-embedded live each turn with mxbai-embed-large (the slice intentionally pays the embedder cost too).

ground truth

No retrieval recall measured here; this slice answers a different question (memory pressure, throughput under co-residence).

notes

Passages: the text shown to the LLM as RAG context was re-fetched from Wikimedia and re-chunked at the same 400-char target since the original chunking script was lost. The text at id N is not guaranteed to be the exact chunk that produced embedding row N, but the corpus pool is the same domain at the same chunking granularity. For the metric measured here (memory pressure under joint LLM+vector-store load) this is fine; recall fidelity is covered by slices A and B against frozen ground truth.

Thermals: not captured for this run (sudo cache expired before powermetrics sampling started). The table shows cpu_idle_p50 and load_avg_1m_p95 so you can see how loaded the system was during each cell; future reruns will surface die temps and fan RPM when sudo holds for the whole window.

RAM cost while the LLM is running

Resident memory of the vector-store process during the RAG loop, reported as p95 of the 500 ms samples taken across the 50-turn window (~9 minutes per cell, ~1 000 samples). p95 means "95% of the time the engine sat at or below this value" · this filters out single-sample spikes while still reporting the working set under sustained use. At 1M vectors: skeg sits at ~63 MB, qdrant at ~1 893 MB (30× more). Same LLM resident in 2.2 GB, same corpus, same query stream. Chroma is in-process so it doesn't appear here (its memory is folded into the Python orchestrator).

adaptive working set · sustained vs RAG load

The same engine, same 1M corpus, two workloads: slice A drove a sustained query loop (~590 QPS, cache stays hot); slice D drove a low-rate RAG loop (~1 query every ~4 s, long idle gaps between turns). The chart shows how much RAM each engine kept resident in each regime.

Why skeg collapses 7× (419 → 63 MB) and qdrant only halves (4 147 → 1 893 MB): skeg is SSD-primary. When queries thin out, the OS evicts cold pages of the PQ-128 codes and the Vamana graph; the next query pulls them back from SSD on demand. The working set adapts to the access pattern. qdrant-hnsw needs the whole HNSW graph plus the full f32 vectors resident to serve any query, so it can't release as much · its working set is bounded below by the index size, not the access pattern.

Both numbers are real. Production high-QPS: skeg 419 MB vs qdrant 4 147 MB (10× gap). Personal-AI typical RAG: skeg 63 MB vs qdrant 1 893 MB (30× gap, because qdrant can't compress the way skeg can).

LLM throughput

What this is: tokens per second the LLM sustains during the 50-turn loop, median across the loop. The LLM is the same model and the same query stream regardless of backend; the only thing changing is which vector store is answering retrieval.

How to read it: at this size of model (3B, ~2.2 GiB resident) on a 16 GiB M1 the system has just enough headroom that backend choice does not move tokens/sec noticeably. This is not skeg's win on this chart: the system isn't saturated yet, so everyone keeps the LLM fed. The story shifts when you swap in an 8B+ model or run on a tighter budget · the memory chart above is the leading indicator for that.

retrieval latency

Per-query latency p99. Log Y · backends differ by order of magnitude. This number is dominated by protocol overhead, not search; see slice B for the recall/latency frontier.

all numbers

Click headers to sort.

corpus	backend	turns	tps p50	retr p50 ms	retr p99 ms	backend rss p50	backend rss max	compr p95 MB	press p95 %	cpu idle p50 %	load 1m p95
10,000	skeg	147	31.0	2.0	3.6	7 MB	9 MB	1280	75	86	2.7
10,000	qdrant	147	31.7	4.9	7.9	119 MB	230 MB	1310	74	85	4.4
10,000	chroma	147	29.7	5.4	27.0	-	-	1331	76	72	28.9
10,000	baseline	49	28.5	0.0	0.0	-	-	1304	74	77	28.9
25,000	skeg	147	31.3	2.4	3.6	9 MB	25 MB	1187	70	86	6.0
25,000	qdrant	147	30.1	7.3	29.1	121 MB	430 MB	1257	76	85	4.1
25,000	chroma	147	31.0	5.9	12.5	-	-	1378	77	78	21.8
25,000	baseline	49	28.4	0.0	0.0	-	-	1429	77	68	28.1
50,000	skeg	147	28.1	3.1	5.8	12 MB	36 MB	1224	74	90	5.3
50,000	qdrant	147	27.5	5.5	9.4	359 MB	663 MB	1243	75	92	2.1
50,000	chroma	147	28.8	6.5	11.8	-	-	1226	74	78	17.1
50,000	baseline	49	28.6	0.0	0.0	-	-	1422	75	78	12.1
100,000	skeg	147	29.9	3.2	6.0	17 MB	30 MB	1239	75	92	10.7
100,000	qdrant	147	30.0	5.0	12.8	405 MB	740 MB	1263	74	89	4.6
100,000	chroma	147	30.8	7.1	32.7	-	-	1323	77	77	17.1
100,000	baseline	49	28.3	0.0	0.0	-	-	1497	75	78	19.5
200,000	skeg	147	27.4	2.5	4.8	27 MB	46 MB	1014	70	90	57.6
200,000	qdrant	147	29.8	5.6	12.0	298 MB	1278 MB	1090	71	92	4.7
200,000	chroma	147	30.3	6.8	22.3	-	-	1356	76	78	10.6
200,000	baseline	49	29.3	0.0	0.0	-	-	1427	74	86	3.0
350,000	skeg	147	32.5	3.3	6.8	38 MB	45 MB	895	65	89	54.8
350,000	qdrant	147	30.7	5.9	25.4	785 MB	1531 MB	930	67	87	9.1
350,000	chroma	147	31.3	6.6	23.1	-	-	1390	76	86	5.3
350,000	baseline	49	30.0	0.0	0.0	-	-	1468	76	84	3.9
500,000	skeg	147	32.2	3.6	6.6	51 MB	220 MB	621	58	85	19.7
500,000	qdrant	147	33.2	5.8	37.9	1659 MB	3530 MB	889	63	86	7.2
500,000	chroma	147	31.3	7.9	23.5	-	-	1377	75	78	14.2
500,000	baseline	49	29.2	0.0	0.0	-	-	1414	76	69	16.4
750,000	skeg	147	31.8	4.3	18.6	66 MB	79 MB	742	62	84	55.2
750,000	qdrant	147	31.8	9.7	53.6	580 MB	2120 MB	1147	73	86	12.1
750,000	chroma	147	33.2	10.8	18.1	-	-	1414	76	85	5.8
750,000	baseline	49	28.4	0.0	0.0	-	-	1394	74	86	2.5
1,000,000	skeg	147	30.7	5.5	24.4	54 MB	67 MB	1358	77	83	52.9
1,000,000	qdrant	147	30.5	21.0	551.7	253 MB	2387 MB	1272	76	79	14.7
1,000,000	chroma	147	31.5	12.1	21.2	-	-	1399	76	85	4.5
1,000,000	baseline	49	29.5	0.0	0.0	-	-	1395	77	85	2.7

methodology in one minute

LLM: Llama 3.2 3B Q4_K_M via Ollama (~2.2 GB resident).
Embedder: mxbai-embed-large (1024d).
Corpus: enwiki chunked 400 chars, mxbai embeddings. Sweep 10k → 1M.
Workload: 50-turn conversational RAG (embed → top-10 retrieval → 200-tok gen).
Backends: skeg (PQ-128, RESP3), Qdrant (HNSW default), Chroma (in-process). One run at a time, 60 s cooldown.
Baseline: LLM-only with random passages from the same pool. No vector store running.
Repetitions: 3 per cell. Median + p10/p90 reported.
Machine state: as-is. No process kill. Other apps stay open.
Sampler: 500 ms interval, PID-injected process labels (avoids substring false positives), system metrics via vm_stat + memory_pressure.