Original Reddit post

This is the Claude Code session that finally made me build it. Day 23. I’d spent 3 sessions tuning RRF retrieval weights, landed on 0.5 BM25 / 0.3 vector / 0.2 graph, written tests, moved on. Sunday morning I open a fresh session, ask Claude Code to add a feature. Within 12 turns it proposes a different weighting scheme. With confidence. Like the previous decisions never happened. That was the moment. I was building a memory layer for Claude Code, using Claude Code, and the exact problem I was solving was happening to me in real time, in the codebase that was supposed to fix it. What it is Memtrace is an AST-powered memory layer that ships as an MCP server. The benchmark harness is open. It auto-registers with Claude Code, Cursor, Codex, and any other MCP-ready client. Two things it does that nothing else I’ve tried does cleanly: Always-fresh structural memory. Every file save triggers an incremental snapshot in the low tens of ms. Before a refactor, the agent queries the call graph for the blast radius — callers, tests, consumers — before it writes a line. Rewind. The graph is bi-temporal: every node and edge carries valid_at AND invalid_at timestamps. “What did getUserById look like Monday before the broken commit” is a real query, not a git-blame heuristic. When the agent debugs a regression it replays how the function got to its current shape instead of guessing from now . The architectural bet Most memory tools (Mem0, Graphiti) call an LLM during indexing. They want it to “extract” structure from chunks. Cost: 31 minutes for 1,500 files in the comparison runs I did, plus per-token API spend, and you can’t afford to re-index on every edit. So your memory is always one session stale. I pulled the LLM out of the indexing path entirely. Tree-sitter parses 20+ languages into an AST. The AST is the structural representation. You don’t pay an LLM to re-derive what your compiler already knows. Retrieval is hybrid at query time: Tantivy BM25 (lexical) plus Jina-code 768-dim embeddings via HNSW (semantic), fused with RRF at k=60. Jina-code matters here because it’s trained on code, not generic prose. The semantic leg actually understands “this is an auth handler” instead of pattern-matching on the word “auth”. The bet pays off in one place: you can afford to snapshot every edit because there’s no per-token cost. Indexing is bottlenecked by disk I/O, not OpenAI. That’s what makes always-fresh memory actually work. The “rewind” feature is just what falls out of having a bi-temporal graph that stays current. How Claude Code helped (and where it didn’t) Honest version, since the sub asks: RRF scoring was ~30 iterations of “I described a trade-off, Claude Code proposed three weight schemes with rationales, I ran benchmarks, pasted results, repeated.” Wouldn’t have converged in a weekend without it. Tree-sitter bindings for 20 grammars worked first-pass for most of them. I haven’t hand-written a cursor traversal in months. Back when I was still on Memgraph as the storage layer, Claude Code defaulted to Neo4j’s Cypher dialect constantly. Subtle differences. I gave up correcting in-session and put it in CLAUDE.md. Worked. (Different story now that the storage is MemDB.) The bi-temporal edge cases were the place Claude Code couldn’t shortcut. valid_at for a function whose signature changed mid-session. Working-tree episodes for files saved but not yet committed. I had to think and type those myself. When the beta binary was ready I pointed it at Memtrace’s own codebase. The Sunday-morning weight-confusion stopped. The blind refactor suggestions stopped. The dogfood test wasn’t a marketing exercise. It was the only proof I had that the bet was real. About the waitlist I’ll be honest: I got a bit crushed since putting this out a week ago. There are 200+ people waiting, and yes, you currently need an approval key to run the binary. I am throttling onboarding to 50 users a week. It’s not artificial scarcity or a velvet rope—it’s just my actual human limit to guarantee a 24-hour bug-fix turnaround. The reality is that new users throw structural curveballs that my MacBook never would have surfaced. In the last two weeks alone: an ONNX cold-start issue on Windows (8 seconds vs 600ms on macOS, fixed Friday), a gitignore parser breaking on backslash paths (twelve-line fix, two hours to find), and Python TYPE_CHECKING blocks confusing the import resolver. Every one came from a real user; none were in my test corpus. Lua and Swift parsers are shipping today specifically because two beta users asked for them and stayed on Discord long enough to test the first builds. That’s the tight feedback loop I’m protecting. If I opened the floodgates to 5,000 users right now, I’d watch the first impression die for 4,500 of them. I’d rather have 500 users for whom this is magic than 50,000 for whom it’s broken. The benchmark harness is fully open and runnable without the binary if you want to verify the claims before joining the queue. What I actually want from this thread Comment with a real failure mode you’ve hit with Claude Code (or Cursor / Codex / Gemini CLI) — the specific thing it forgets, the bug that ruins your flow. Three I’m specifically hunting: The Claude Code failure that hurts most in your daily loop. The specific thing it forgets. Naming the symptom is the whole gift. Would OSS-ing the AST indexer (retrieval + temporal engine staying closed) change your decision to adopt? Honest answers, especially harsh ones. If you’ve used Mem0, Graphiti, claude-mem, or any other agent-memory tool — what broke? War stories before I claim my approach beats them. Roast the architecture. Roast the gate. Roast the queue-jumping mechanic. Roast me. submitted by /u/WEEZIEDEEZIE

Originally posted by u/WEEZIEDEEZIE on r/ClaudeCode