Original Reddit post

Like many of you, I saw the announcement about claude -p moving to a separate Agent SDK credit starting June 15. $20/mo on Pro, $100 on Max 5x, $200 on Max 20x, billed at full API rates instead of the subsidized flat rate we’ve been getting. This problem isn’t new for me. I’ve been running multiple Max subscriptions and still hitting limits. When you’re building across several codebases and running automated pipelines, even Max plans run dry. That’s actually why I started building Maggy in the first place - not because of this announcement, but because I was burning through tokens faster than any single subscription could handle. Maggy started as a Claude Code bootstrap - a collection of skills, hooks, and rules - basically my development workflow. But the token problem kept nagging me, so I kept building. It’s now an open source multi-model routing system that decides which AI model should handle each task based on complexity. It’s still early stages, but here’s what I’ve done so far and what I’ve learned. THE CORE IDEA: STOP SENDING EVERYTHING TO CLAUDE Most coding tasks don’t need a $0.03/1K-token premium model. Fixing a typo? Adding a log statement? Simple CRUD? A free local model or a cheap API handles that just fine. Maggy assigns a “blast score” (1-10 complexity) to every task using semantic classification, then routes it to the right model tier: In a 6-task benchmark sprint, this routing reduced Claude usage from 100% of tasks down to 17%. That’s an 83% reduction in premium model burn.- Blast 1-3: Ollama local (qwen3-coder) or Kimi — free to $0.001/1K tokens - Blast 4-6: Codex or Kimi — $0.001 to $0.01/1K tokens - Blast 7-10: Claude — $0.03/1K tokens - Security tasks: Always Claude regardless of blast score BENCHMARK RESULTS I’ve run maggy across multiple projects and here are some results (I’m still working on it): Speed (Maggy vs Claude Code):

  • Spec generation (blast 2): 50.4s vs 48.6s - DB schema (blast 3): 86.6s vs 67.2s - CRUD endpoints (blast 5): 147.1s vs 160.6s — Maggy wins this one - API + summary (blast 5): 133.9s vs 130.8s - Frontend (blast 6): 280.1s vs 121.9s - Security review (blast 8): 209.5s vs 151.9s Claude Code is faster overall (33%), which makes sense — one model, no routing overhead, no fallback chains. But the point isn’t speed, it’s sustainability. Maggy only hit Claude for 1 out of 6 tasks. Quality: Maggy scored 7.4/10 vs Claude’s 7.8/10 on a weighted average. Essentially tied. Maggy actually scored higher on security (10/10 vs 7/10) because it runs a dedicated security review pass. Claude won on test generation and spec writing. Where each task actually ran:
  • Local Ollama: 1 task (17%) — $0.00 - Kimi: 1 task (17%) — basically free - Codex: 3 tasks (50%) — cheap - Claude: 1 task (17%) — premium That distribution is the whole point. Your expensive model only fires when it actually needs to. SELF-LEARNING BLUEPRINTS This is the part I’m most excited about. Maggy captures tool sequences from successful tasks and stores them as “blueprints.” After 3 successful runs of a similar task pattern, it automatically routes that type of work to the cheapest model that got it done. Real example: I generate benchmark reports for 16+ companies. After 3 successful runs with Claude, Maggy learned the pattern and now routes all of them to the local model. That went from $0.03/1K tokens to $0.00. The blueprint system needs at least 3 successes and 70% confidence before it trusts a cheaper model, so it doesn’t downgrade prematurely. HARDWARE AND MODEL RECOMMENDATIONS I run this on a Mac Studio M4 Max with 128GB RAM. Qwen3-Coder 30B (MoE architecture, Q8_0 quantization) runs at 75.7 tokens/sec locally. That’s 3.4x faster than the previous Qwen2.5-Coder and 2x faster than Claude’s API throughput. The MoE design means only 3.3B parameters are active per inference, so it’s surprisingly fast. But not everyone has 128GB of RAM sitting around, and that’s fine. If you don’t have the hardware for local models, use Kimi (what I am using) or DeepSeek (deepseek-v3) as your cheap tier instead. Both have 128K context windows and cost $0.001-0.002 per 1K tokens. They handle blast 1-5 tasks well. The routing logic doesn’t care what sits in each tier slot — you just swap which model is your “cheap” option. WHAT THE JUNE 15 CHANGE ACTUALLY MEANS For anyone building on claude -p, here’s the practical impact: Previously, claude -p was subsidized roughly 25x on subscription plans. You’d consume $500 worth of API tokens but it just counted against your flat subscription. That subsidy is now gone for programmatic usage. On a Pro plan, $20/month at full API rates gets you roughly 88 chat messages with context before you’re done for the month. On Max 5x, $100 goes further but still runs out under heavy use. Everything that uses claude -p is affected: your scripts, CI pipelines, wrapper tools, GitHub Actions, third-party apps built on the Agent SDK. All of it draws from this new capped credit. Your interactive Claude Code usage (the REPL you type into) stays on your subscription, unchanged. It’s only the programmatic stuff that moves. WHAT YOU CAN DO RIGHT NOW
  1. Audit your claude -p usage. If you have anything calling claude -p automatically, know your volume before June 15. 2. Route by complexity. This is the big one. Not every task needs Claude. Build a routing layer, or use one that exists. 3. Track spend per model. Know what each model costs you. Even a simple counter helps you see where the money goes. 4. If you’re on Pro ($20/mo credit), seriously consider Max 5x ($100/mo). $20 at full API rates is gone in a day of real work. 5. Claim your credit. Anthropic sends an email June 8. You have to opt in - nothing is granted automatically. 6. Look into local models if you have the hardware. Even a Mac Mini with 32GB can run smaller quantized models for simple tasks. The announcement caught a lot of people off guard, but the writing was on the wall. Anthropic can’t subsidize unlimited programmatic usage forever. The answer isn’t to fight the billing model - it’s to stop treating Claude as the default for everything and start routing based on what the task actually needs. Maggy is open source if you want to try it or build on it: github.com/alinaqi/maggy Happy to answer questions about the routing setup, benchmark methodology, or model selection. There is a lot of detailed in maggy from memory organization, to building knowledge base to reward based learning inside maggy. submitted by /u/naxmax2019

Originally posted by u/naxmax2019 on r/ClaudeCode