Like a lot of you, I noticed my Anthropic bill got way worse in March. After some digging (as many of you likely know), I learned that Anthropic quietly dropped the default prompt-cache TTL from 1 hour to 5 minutes. A cache miss re-bills the entire conversation prefix at fresh-input rates, so under a 5-min TTL:
- any idle gap >5 min triggers a full re-bill on your next message
- resuming a session after >1h is a near-guaranteed full re-bill
- a 1M-context session multiplies the miss cost up to 5x vs 200k I went through my own JSONL logs in ~/.claude/projects/ and confirmed it: 67–80% of my cache-write tokens were landing in the 5-minute tier. I was literally paying to write a cache that expired before it could ever be reused. It added up to ~$2,700 in overages in that single month. True, Anthropic’s limits got way better in April (the day the SpaceX compute deal was announced). But I realized I couldn’t be so susceptible to Anthropic’s whims on billing, and the fluctuations in their capacity – not when this tool has become core to almost all of my workflows. I vowed I was never going to spend on Anthropic overages ever again. So I stopped. But that meant I was getting a lot less done! And that’s not how I wanted to work. So, instead of just downgrading to a cheaper model and resigning myself to working less, I spent weeks learning the various levers to minimize and optimize token consumption, and then working with Opus to design and build a system around one principle I call Use Opus Less to Use Opus More : Opus is the most valuable worker you have, and most of what a session spends it on (file-finding, codegen from a spec, mechanical edits, log-grepping) doesn’t need it. Route that stuff to cheaper/free labor and you preserve Opus budget for the reasoning and judgment that actually need it. It breaks into three parts: Make Claude Code itself cheaper — restore the 1h cache tier (ENABLE_PROMPT_CACHING_1H=1), a hook that halts before any CLAUDE.md edit (those silently invalidate your whole cached prefix and re-bill the rest of the session), a 200k/1M context toggle, auto-compaction tuning, and per-project pruning of MCP servers/skills (every one you load is in the cached prefix on every turn). Delegate by default — a cost-ordered ladder: local models (free) → Codex → DeepSeek (~1/20th of Anthropic) → Anthropic only as a last resort. Two mechanical “bright line” rules: scout-before-read, and code-from-spec always delegates. Infrastructure that makes it automatic — local fleet on a couple of Macs (I landed on llama.cpp over ollama/MLX after an actual bake-off — cancel-on-disconnect with slot reclaim was the deciding factor, not raw speed), per-machine brokers with a priority lane, a routing proxy, and a single registry so nothing drifts. Results so far:
- cache writes in the wasteful 5-min tier: 67–80% → 0%
- overages: ~$2,700/mo → $0
- concurrent projects I can run inside the same 5-hour limits: 1 → 3–4 The big thing: none of this system is all-or-nothing . The cache flag is one line and the single biggest win — and the Claude.md edit guardrails are also huge (if you don’t know that your Claude updated your claude.md you can literally waste millions of extra tokens in a session without realizing it). Most of the big savings need no extra hardware at all. The local fleet is the last rung, not the entry point. I wrote the whole thing up in detail (functional + technical layers, adopt-piece-by-piece guide, the full benchmark table): https://crosswi.re/claude-code-cost-optimization Happy to answer questions. Curious how many of you got hit by the TTL change and what you did about it. To be absolutely clear: this is NOT a promotion. I’m not selling anything, and I’m not even pointing to a github repo . This is purely a writeup of how I personally addressed the challenge of optimizing my token consumption and making a really significant difference in my overall velocity. submitted by /u/ShieldYourBody
Originally posted by u/ShieldYourBody on r/ClaudeCode
You must log in or # to comment.
