Original Reddit post

I’ve been experimenting with a way to handle extremely long coding sessions with Claude without hitting the 200k context limit or triggering the “lossy compression” (compaction) that happens when conversations get too long. I developed a VS Code extension called Damocles and implemented a feature called “Distill Mode.” Technically speaking, it’s a local RAG (Retrieval-Augmented Generation) approach, but instead of using vector embeddings, it uses stateless queries with BM25 keyword search. I thought the architecture was interesting enough to share, specifically regarding how it handles hallucinations. The problem with standard context Usually, every time you send a message to Claude, the API resends your entire conversation history . Eventually, you hit the limit, and the model starts compacting earlier messages. This often leads to the model forgetting instructions you gave it at the start of the chat. The solution: “Distill Mode” Instead of replaying the whole history, this workflow: Runs each query stateless — no prior messages are sent. Summarizes via Haiku — after each response, Haiku (Claude’s fast model, called via the API) writes structured annotations about the interaction to a local SQLite database. Injects context — before your next message, it searches those notes for relevant entries and injects roughly 4k tokens of context. This means you never hit the context window limit . Your session can be 200 messages long, and the model still receives relevant context without the noise. Why BM25? (The retrieval mechanism) Instead of vector search, this setup uses BM25 — the same ranking algorithm behind Elasticsearch and most search engines. It works via an FTS5 full-text index over the local SQLite entries. Why this works for code: it uses Porter stemming (so “refactoring” matches “refactor”) and downweights common stopwords while prioritizing rare, specific terms from your prompt. Expansion passes — it doesn’t just grab the keyword match; it also pulls in: Related files — if an entry references other files, entries from those files in the same prompt are included Semantic groups — Haiku labels related entries with a group name (e.g. “authentication-flow”); if one group member is selected, up to 3 more from the same group are pulled in Cross-prompt links — during annotation, Haiku tags relationships between entries across different prompts ( depends_on , extends , reverts , related ). When reranking is enabled, linked entries are pulled in even if BM25 didn’t surface them directly All bounded by the token budget — entries are added in rank order until the budget is full. Reducing hallucinations A major benefit I noticed is the reduction in noise. In standard mode, the context window accumulates raw tool outputs — file reads, massive grep outputs, bash logs — most of which are no longer relevant by the time you’re 50 messages in. Even after compaction kicks in, the lossy summary can carry forward noisy artifacts from those tool results. By using this “Distill” approach, only curated, annotated summaries are injected. The signal-to-noise ratio is much higher, preventing Claude from hallucinating based on stale tool outputs. Configuration If anyone else wants to try Damocles or build a similar local-RAG setup, here are the settings I’m using: Trade-offs If the search misses the right context, Claude effectively has amnesia for that turn. Normal mode guarantees it sees everything (until compaction kicks in and it doesn’t). Slight delay after each response while Haiku annotates the notes via API. For short conversations, normal mode is fine and simpler. TL;DR Normal mode resends everything and eventually compacts, losing context. Distill mode keeps structured notes locally, searches them per-message via BM25, and never compacts. Use it for long sessions. Has anyone else tried using BM25/keyword search over vector embeddings for maintaining long-term context? I’m curious how it compares to standard vector RAG implementations. submitted by /u/Aizenvolt11

Originally posted by u/Aizenvolt11 on r/ClaudeCode