Original Reddit post

With the new session limit changes and the 1M context window, a lot of people are confused about why longer sessions eat more usage. I’ve been tracking token flows across my Claude Code sessions. A key piece that folks aren’t aware of: the 5-minute cache TTL. Every message you send in Claude Code re-sends the entire conversation to the API. There’s no memory between messages. Message 50 sends all 49 previous exchanges before Claude starts thinking about your new one. Message 1 might be 14K tokens. Message 50 is 79K+. Without caching, a 100-turn Opus session would cost $50-100 in input tokens. That would bankrupt Anthropic on every Pro subscription. So they cache. Cached reads cost 10% of the normal input price. $0.50 per million tokens instead of $5. A $100 Opus session drops to ~$19 with a 90% hit rate. Someone on this sub wired Claude Code into a dedicated vLLM and measured it: 47 million prompt tokens, 45 million cache hits. 96.39% hit rate. Out of 47M tokens sent, the model only did real work on 1.6M. Caching works. So why do long sessions cost more? Most people assume it’s because Claude “re-reads” more context each message. But re-reading cached context is cheap. 90% off is 90% off. The real cost is cache busts from the 5-minute TTL. The cache expires after 5 minutes of inactivity. Each hit resets the timer. If you’re sending messages every couple minutes, the cache stays warm forever. But pause for six minutes and the cache is evicted. Your next message pays full price. Actually worse than full price. Cache writes on Opus cost $6.25/MTok — 25% more than the normal $5/MTok because you’re paying for VRAM allocation on top of compute. One cache bust at 100K tokens of context costs ~$0.63 just for the write. At 500K tokens (easy to hit with the new 1M window), that’s ~$3.13. Same coffee break. 5x the bill. Now multiply that across a marathon session. You’re working for hours. You hit 5-10 natural pauses over five minutes. Each pause re-processes an ever-growing conversation at full price. This is why marathon sessions destroy your limits. Because each cache bust re-processes hundreds of thousands of tokens at 125% of normal input cost. The 1M context window makes it worse. Before, sessions compacted around 100-200K. Now you run longer, accumulate more context, and each bust hits a bigger payload. There are also things that bust your cache you might not expect. The cache matches from the beginning of your request forward, byte for byte. If you put something like a timestamp in your system prompt, then your system prompt will never be cached. Adding or removing an MCP tool mid-session also breaks it. Tool definitions are part of the cached prefix. Change them and every previous message gets re-processed. Same with switching models. Caches are per-model. Opus and Haiku can’t share a cache because each model computes the KV matrices differently. So what do you do? Start fresh sessions for new tasks. Don’t keep one running all day. If you’re stepping away for more than five minutes, start new when you come back. Run /compact before a break - smaller context means a cheaper cache bust if the TTL expires. Don’t add MCP tools mid-session. Don’t put timestamps at the top of your system prompt. Understanding this one mechanism is probably the most useful thing you can do to stretch your limits. I wrote a longer piece with API experiments and actual traces here . submitted by /u/lucifer605

Originally posted by u/lucifer605 on r/ClaudeCode

  • Log in | Sign up@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 hour ago

    What to do about it is to not pay for llm.

    Studies have shown that programmers who use llms believe they are more productive hut are actually less productive.