eifachposte

eifachposte

Just about every day someone complains about Anthropic’s token usage, and since I no longer overrun my plan, I thought this approach might help someone. My first instinct working on a 559,000-line legacy Java codebase was to throw subagents at the problem. Spawn an Explore agent, let it read fifty files, get back a summary. The trouble is that subagents run on the same model as the main conversation, which means the same Anthropic tokens, often more of them, once you count the subagent reading the files, writing the summary, and the summary then entering the parent context. The savings I imagined were not there. I didn’t encounter a real cost shift until I started collaborating across multiple models. The general idea is to delegate tasks that don’t require extensive thought, such as summarizing what’s in Java files, to cheaper, faster models via one or more MCP servers. As it turns out, I’m also convinced this led to fewer mistakes , and software that works almost all of the time. The actual MCP you use to delegate to other models doesn’t matter. I use an old MCP server called PAL (renamed from Zen) to route to Gemini, GPT, and DeepSeek or Qwen, so a fifty-file code review or architecture analysis genuinely does not effect my Anthropic subscription. Even paying API rates for the other models, it’s cheaper, especially if you specifically pick the appropriate model for the task: fast and cheap for code exploring, the frontier models for plan reviews, and the slightly older but still good frontier models for code reviews. I’m not saying everybody should use this particular MCP server, since newer ones are coming out every day that do similar things. Actions that are outsourced: codereview — mandatory before building significant changes per CLAUDE.md precommit — mandatory before commits debug — investigation and fixes thinkdeep — code analysis spanning >3 files consensus — architecture decisions analyze — general code analysis refactor — refactoring proposals testgen — test generation docgen — documentation generation apilookup — API/SDK questions tracer — execution tracing chat — brainstorming challenge — adversarial critique Beyond that, the techniques that move the needle are unglamorous: grep before reading, pull file windows with offset and limit instead of whole files, pipe build output through grep and tail before it ever reaches the conversation, and keep a memory system that preserves what was already figured out so the next session does not re-derive it from source. They require discipline about what enters the context window in the first place, but luckily, all of these practices are in the CLAUDE.md and memory files, so I never have to remember. Subagents still earn their keep, just not for the reason most people give. They isolate the main conversation from a thousand-line file or a noisy search result, which keeps the cache warm and the context window navigable across long sessions. That is real value. It just is not cheaper. If the goal is to get the most from your Anthropic subscription, route the heavy lifting to other models. And here is my justification for the advantages of using multiple models for every single stage of a project. I initially started to use multiple models to increase the accuracy rate; token reduction on Anthropic was just a useful side effect. https://czei.org/blog/multi-llm-spec-driven-development/ submitted by /u/czei

Originally posted by u/czei on r/ClaudeCode

Reducing Claude Code Token Usage

Reducing Claude Code Token Usage