Original Reddit post

(Follow-up to my previous post It’s time to move / free our tokens ) The last thread was mostly speculation, so I spent yesterday actually benchmarking the Claude Agent SDK with Anthropic models against a local Qwen3.6-35B-A3B on my RTX 3090 , to back the claim with real numbers. Thanks to u/gdraper99 for the suggestion that pushed me to actually run this. Three questions, three verdicts:

  1. Is the Haiku tier replaceable by Qwen? → Yes. 9/10 on the Opus-judge for verify, parity with the Anthropic ceiling, and verify is the dominant workload at ~1,300 calls per run. It even scores one point above Anthropic Haiku on importance, and runs ~5× faster end-to-end on this tier.
  2. Is the Sonnet tier replaceable by Qwen? → No. With thinking ON on the rewrite step, Qwen lands at 6/10 vs the 9/10 ceiling. The real failure mode isn’t the reasoning (the fixes themselves get applied correctly), it’s instruction-following on output format: parasitic prose preambles (“Let me analyse…”), missing edit tags, inconsistent inline citations. So those 8 calls per run stay on Anthropic in production.
  3. Is the Opus tier replaceable by Qwen? → Not even attempted. Opus stays on Anthropic for anything where I can’t accept a regression (final verification, user-facing summaries). Qwen already plateaus 3 points below ceiling on the Sonnet tier, so betting on it for Opus-tier work would be reckless. End-to-end impact on my fact-check pipeline: Runtime: ~4h → ~59 min Anthropic API calls per run: 1,696 → 8 Bottom line: move the volume (Haiku tier, ~99% of calls) to local Qwen, keep the stakes (Sonnet/Opus) on Anthropic. Full bench, 5 providers × 4 workloads × N=5, Opus-as-judge with an Anthropic-vs-Anthropic ceiling for calibration: 👉 https://anatoly.cloud/research/local-llm-claude-agent-sdk-turboquant Happy to dig into the setup or the judging methodology in the comments. submitted by /u/Apprehensive_Row9873

Originally posted by u/Apprehensive_Row9873 on r/ClaudeCode