◆ Arro / twelve-weeks-three-bugs
Twelve weeks, three bugs
Twelve weeks into building Context Fabric, I finally stopped trusting my own numbers. Good call. Three bugs later, the thing looks a lot closer to release.
I've been building Context Fabric for twelve weeks now. It's the memory and retrieval layer behind Vesper, exposed over MCP, built around a very stubborn idea: agent memory should run locally, stay inspectable, and not disappear because some API changed terms on a Tuesday.
This week I finally ran it properly against the public benchmarks.
The short version is that it did well.
The more useful version is that it did well after I found three stupid bugs that had been kneecapping it for weeks.
That is both annoying and, honestly, kind of encouraging. If the score moves that much when the fixes are this unglamorous, the system is getting close to the point where the remaining work feels like tightening, not rescue.
Where it stands
| Benchmark | Metric | Context Fabric | OpenAI text-embedding-3-small |
|---|---|---|---|
| BEIR SciFact | nDCG@10 | 0.744 | 0.774 |
| BEIR SciFact | Recall@100 | 0.967 | ~0.93 |
| BEIR FiQA | nDCG@10 | 0.380 | 0.397 |
| BEIR FiQA | Recall@100 | 0.736 | ~0.69 |
| LongMemEval_S | Hit@5 | 0.952 | — |
On straight ranking quality, nDCG@10, I'm still a few points behind OpenAI's cheap paid tier on both BEIR datasets.
On deep recall, which I care about more for agent memory, I'm ahead on both.
That distinction matters.
If you're building search for a user-facing product, top-10 ranking is the whole game. If you're building memory for an agent, the more practical question is whether the right thing is somewhere in the candidate set the model can still work with. If the answer is yes, the rest becomes prompt and reasoning work.
That is why Recall@100 matters so much here.
And on LongMemEval, a Hit@5 of 0.952 puts the retrieval ceiling above what Zep and Mem0 publish on the same benchmark.
So no, this is not me claiming I beat OpenAI at embeddings.
It is me saying the local stack is now close enough that the gap is measured in single digits, not in "come back when you have a datacenter." That's a very different conversation.
What's underneath
The funny part is how small the whole thing still is.
OpenAI's version of this lives in their datacenter. API keys, rate limits, network hops, per-token pricing, and a model nobody outside the building will ever get to inspect.
Mine is a SQLite file.
bge-small-en-v1.5 for embeddings. FTS5 for the lexical stream. sqlite-vec for vector search. Everything runs in-process. The whole stack is around 80 MB on disk, permissively licensed, with no network dependency once it's installed.
I keep coming back to that because it still feels slightly ridiculous.
A few points on nDCG@10 is the distance between running locally on a laptop and renting somebody else's infrastructure forever. I would take that trade again without much soul-searching.
The three bugs
This is the part that made me laugh, then swear, then finally trust the numbers.
Three of the missing points were my fault, not the model's.
First, I had left the old v1 embedder as the default even though v1.5 had been out for ages.
Second, I was not applying the BGE query-prefix convention, the one basically every published benchmark number for this family relies on. Which means I was comparing my unprefixed queries against everybody else's properly configured runs and wondering why the curve looked soft.
Third, sqlite-vec was optional in package.json, so a default install could silently fall back to the slow path. On top of that, a rowid coercion bug in that path was quietly dropping inserts.
That last one is especially offensive. There is something very humbling about losing score because a bug is politely throwing away part of your data while the rest of the system continues to behave as if everything is fine.
You really can ship a system for twelve weeks with a one-line bug that costs you twenty percent on the scoreboard.
Fixing those three things in v0.13.0 moved SciFact from 0.679 to 0.744, FiQA from 0.325 to 0.380, and query latency from 2.9 seconds to 91 milliseconds.
None of these were exotic research problems. They were all normal engineering mistakes. Configuration drift. Missing conventions. Install-path sloppiness. The kind of bugs you only find once you stop admiring the architecture and start measuring the actual system.
That is probably the most reassuring thing about this week. The improvements did not come from me inventing a new retrieval theory. They came from making the implementation stop lying.
What's still missing
It is not done.
FiQA is still below the published bge-base-en-v1.5 baseline, and I think I know why. Right now the hybrid stack weights BM25 and vector scores evenly. On FiQA that is a bad bargain, because formal finance answers and colloquial user questions often do not share much vocabulary. BM25 is the weaker stream there, so fusing it 50/50 drags the combined score down. A weighting gate would probably buy back another point or two.
LongMemEval has a different weakness. The weak categories are single-session-preference and single-session-user, where Hit@1 is only around 0.65. Same pattern in both cases: the fact I need is one sentence inside a longer session, and the session-level embedding smooths it into mush.
The fix there is probably LLM-based fact extraction. Pull the atomic facts out, store them separately, and stop pretending a whole conversation should have one vector and one meaning. Zep and Mem0 both do some version of this. I don't, yet.
Neither of those fixes is tiny. Both are real work.
But they feel like release work now. Tuning work. Product-shaping work.
That is a much better class of problem than, "surprise, your install path is dropping rows."
Why this feels closer to release
That is really what this week changed for me.
Before, Context Fabric still felt half like a promising system and half like a collection of strong ideas I had not embarrassed properly in public.
Now it feels more solid than that.
Not finished. Not polished enough to call done. But close enough that the remaining gaps are legible.
I know where the retrieval is weak. I know which metrics are trailing and why. I know what parts of the architecture are holding up under measurement. I know the local-first argument is no longer just philosophical. It is showing up in actual benchmark numbers.
That is what "closer to release" means to me.
Not that there are no bugs left. There are always bugs left.
It means the system has stopped feeling mysterious.
Reproducing it
The benchmark harness is in the repo. No API keys, nothing leaves the machine you're on.
git clone https://github.com/Abaddollyon/context-fabric.git
cd context-fabric && npm install && npm run build
scripts/bench-public.sh download scifact fiqa longmemeval_s
scripts/bench-gpu.sh bench:beir:scifact
Full sweep was about twenty minutes of wall time on an RTX 3060.
That part matters to me too. If the only way to validate a system is by burning paid credits against somebody else's endpoint, then you do not really own the engineering loop.
Where I land
One SQLite file. A 33-million-parameter embedding model. A local stack that is a few points short of OpenAI on ranking, ahead on deep recall, and already competitive enough that the trade-offs are finally interesting instead of theoretical.
That is a good place to be twelve weeks in.
More importantly, it feels like the kind of place you can actually release from.
Not because it is perfect.
Because the remaining work looks like the work you do after the core starts holding.