eifachposte

eifachposte

TL;DR They removed cache-miss insurance coverage for API calls sourced from Claude Code specifically. The Base

KV-caching . When you send out a prompt, with some files attached… NO, this is not the thing that gets cached, this is digital dust of KB to some MB size, no. When the model loads and embeds tokens of your prompt, compute it through GPT, right before generating the output, the model has some internal state. And this is the thing. For grand models it is in tens to hundreds of GB – expensive to recompute, cheap(-ish) to store, hard to move around due to its shear size (so, it is very local). - API caching . When you are an agent’s author (Claude Code or custom agent) you know you will return with the same part of a prompt over and over. So, the client you and the LLM provider make a deal. You pay 200% price for the first call, but only 10% for all subsequent calls in a 1-hour window. - Cache-miss insurance . Ok, you’ve made the first call paying double. Now, the second call. However! The server your huge cache is located on is busy (you are not alone out there) and cannot serve your second call in foreseeable future, or it was just restarted. Your call will be served by another server with full recompute. BUT! You paid the double price, you can’t just accept an “oupsie” you can’t even control (not your servers) – you have a contract. It is when the LLM provider covers you – your prompt will be fully recomputed but you’ll be charged as-agreed-upon 10% rate, the provider eats the cost and you eat the unexpected latency – fair. This model is sustainable still – go figure how lucrative a good cache management is – for both parties. The subscribers

So, earlier, calls from Claude Code were routed internally to the same usual API endpoints. Full with cache-miss insurance. - Now, in a voice of Amodei: Wait a minute! Why do we insure AND cover users stuck in the middle of our marketing funnel–Pro, Max, shmax–for years. It is fair to cover those who did pay 200% for tokens, but these do not pay for tokens at all. Untangle the insurance from the subscriptions. No buts! You have the usage data to model, we can do it gradually. We can weather the backlash – I know a guy I can safely flip a birdy to on camera – to move public sentiment right before we announce whatever we will be announcing. Nah, to hell with announcing – whatever we will be rolling out. - Further on, the calls from Claude Code being routed to an internal API with the same tech–caching, load-balancing with preference of cache holding pods, etc.–but cache-miss is no longer on the house but gets deducted from the subscriber’s token quota as is. - Token spending from the quota is calculated after the call as it is not known beforehand. Cache-hit? Congrats, you’ve got a usual -0.2% from your 5-hour limit. Cache-miss? Bad luck, -8%. Did you sent a prompt that reads some files sequentially, with some thinking in-between, all the while our true dear API clients doing Their Important Tasks? Bingo, you are on a cache-miss streak, -125% to your 5h, and -10% to the weekly – sometime we find, sometime we lose, muha-ha-ha, achievement unlocked: certified serial looser. What does it explain
Why the peak hours suddenly. Probability of cache misses increases with load, if everybody need compute right now, the likeliness of the server with your cache being busy skyrockets naturally. Why the peak hours stop exactly at 7pm GMT? Because it is not peak hours, it is a pricing policy they switch via “cron” on and off. - Why x2 March promo. The core premise of the pricing model is that some users are flexible and can move their load to off-peak. But they didn’t know how many such users there are. So, they nudged people to go off-peak and count who was able with x2 promo – they didn’t need all of them to move they only needed enough to estimate. - Randomness of the effect across the board. Two points. It is inherently random and stochastic. One day you are smirking on the losers that complain on reddit. Next day you’d get a cache-miss streak draining your Max x20 5-hour limits in a single prompt. While in the evening of the same bad day, your another, Free tier account just allowed you to actually complete that same prompt and then some (all cache hits succeed). Second, as this is just a pricing model is it actually trivial to do A/B with and gradually roll over on per-account/region/tier/ussage-patterns basis. - Their language of peak hour “faster than before”–not x2 or x10 to speed but “it’s stochastic, nobody can have idea how fast, but damn sure faster than before”. - Their general lack of communication and transparency. Humans are notoriously bad with probabilities. They can’t. Go ahead and try to ELI5 this to any non-specialist in a way that hostile media won’t spin to vilify you (even if Amodei would flip all the birds to all the guys he know). - Why March? This needs planning, modelling and fine-tuning, this “innovative” pricing model has to be in the work for months. The rolling it over was likely an OKR for Q1 agreed-upon with investors, and the March is the last month of 26Q1. - The days of overload errors, and the days of slow responses (tens of minutes). These may as well be experiments to fine-tune and optimize the pricing model. Ok, you have a cache-miss situation at hand – the princes is in the tower with your cache but is busy and can’t see you now. Two options. Radical candor: report the server with your cache not available, it is Overloaded (red error) – truthy but unactionable (wait? retry? cache lost, so no hope in trying?). Fair: if it’s busy, we wait in an orderly queue, for minutes upon minutes if necessary – unsustainable when humans are involved (ah, it must be stuck! ESC

ESC I’ll try again, then “yo reddit, why it’s so slow today?!”). - Their focus on cache error fixing. It is generosity – not add slop to the injury for users already in pain. Palliatives and pain killers, not a remedy as there’s just no illness to cure. - Why suddenly Added per-model and cache-hit breakdown to /cost for subscription users in the changelog for 2.1.92 – they know they need to give people at least some information to control quotas. - Why they were going postal on third-party use of Claude Code subscription, OpenCode and the likes. They insured and covered them too for no apparent reason – these are advanced users, with “advanced” means they advanced away from the marketing funnel, away from switching to API use. The consequences

(good) Currently, on API level we have two modes: caching we full insurance and no caching. We may end up getting a third one – caching on best-effort basis, allowing for lower token cost but load-sensitive that would naturally push everyone what can to off-peak time moderating the compute load for everyone else. A win for all. - (good) This model could be sustainable or at least more sustainable but still useable for subscribers. - (bad) It is not a “bug” to fix. Your downgrades to earlier CC version is as relevant as upgrading to any other version, or… just taking a nap: some minutes later you likely find yourself in a different load situation and may roll in a cache-hit streak. - (bad) They may announce an Ultra subscription tier–with usual API insurance–it will be priced in thousands, not hundreds – targeting SMB, not hobbyists or partisan employees. - (ugly) If they pull this out, all others will follow just because of laws economics, and you can’t fight those. Source : I was sitting in my huge leather armchair – thinking. No LLMs were used to hallucinate or write or spell-check this. submitted by /u/the_rigo

Originally posted by u/the_rigo on r/ClaudeCode

My pet theory of limits, quotas, and everything

My pet theory of limits, quotas, and everything

TL;DR They removed cache-miss insurance coverage for API calls sourced from Claude Code specifically. The Base