eifachposte

eifachposte

AI subscriptions keep getting more expensive. GitHub just moved Copilot from request-based to consumption-based pricing , and most of the others are heading the same way. Meanwhile, I kept hearing that local models got good enough to run on a laptop. So I figured it was time to actually try it and see where things stand. I run Qwen3.6-35B on a MacBook Pro M2 Max with 64GB unified RAM. Nothing exotic. No rack, no begging NVIDIA for expensive GPUs. Just a (yes, kind of expensive) MacBook Pro I already owned for work at Aiven. In the last month I’ve: One-shotted full landing pages from short briefs Built several frontend + backend features Fixed a nasty backend race condition bug A year ago I would have called that fantasy on this hardware. Now it’s a Sunday morning. To be fully honest, not all of it made it to production. A lot of it was evaluation work, as Qwen isn’t part of my actual day-to-day stack yet. But for me, this is the first real step toward considering it, and I wanted to share the findings with my colleagues and the community. The honest cons, because it’s not all roses It’s slower than Opus. A landing page that Opus generates in 3-4 minutes takes Qwen 8-9 minutes on my M2 Max. Not unreasonable, but still meaningfully slower than the competition. If you’re benchmarking against Sonnet/Opus latency, you’ll be a bit disappointed (for now). Context blows up fast in agentic loops. Even with 256K, you burn through it faster than you’d expect from a (nearly) state-of-the-art model. There’s a lot of room for improvement here. And if you’re driving Qwen3.6 from an agent like Claude Code, it fills even faster, as other users in this sub have reported ( example thread ). Quality variance by task. Models like Opus one-shot most tasks these days. Qwen3.6 hits around 75% for me. The other 25% it gets close, but needs a couple of iterations to land. The pros, because they’re real No rate limits, no usage anxiety. Counting tokens is no longer a thing. You can focus completely on building instead of saving tokens or thinking about cost. The hardware floor keeps dropping. A year ago this needed an A100. Today it runs on a (yes, powerful) MacBook M2 Max 64GB laptop at roughly 27 tokens per second. Tool calling actually works. This used to be an important missing piece. A year ago, local models would hallucinate tool names or get stuck in loops. With Qwen3.6, tool calling just works. That’s the real unlock for agentic work. Privacy is built-in. sensitive code, internal repos, half-formed ideas you don’t want training the next frontier model. None of it leaves the laptop. You can be confident that your personal or business code stays with you, and isn’t sitting on some third-party server that could be hacked. Why 12-24 months, not “now” and not “5 years” Latency and context limits are still a bit rough. If your job is shipping production code on a deadline, Opus and Sonnet are still the move for most of your day. I’d be lying if I said otherwise. But saying it’s 5+ years away misses what’s already shipped. Look at the delta over the last 12 months: It runs on a reasonably priced MacBook Pro, which is a one-time cost It’s fast enough (though it can still get faster) Quality has improved significantly for real-world use cases (with more headroom to grow) That curve doesn’t stop. It compounds. 12 months from now, the 27B/35B-class models will be where 70B is today, and the runtimes will be 2x faster on the same silicon. 24 months from now, the question won’t be “can I run a useful model locally?” It’ll be “why am I still paying for tokens I could generate for free, and with 100% privacy?” What I’d tell someone on the fence Don’t cancel your Claude Code subscription yet. Run a local model in parallel for 60 days. Use Opus/Sonnet for the latency-critical, deep-reasoning work. Use Qwen3.6 for everything you’d have done overnight or on the weekend, everything experimental, and every “just try it” task where the cost of waiting a few minutes is zero. Over time, the usage ratio might flip. You’ll use the local model more and more. When the next Qwen drops (3.7? 4?), who knows what the ratio will look like. The local LLM takeover isn’t a moment in time. It’s a slope. And the slope already started. What’s next Integrate Qwen3.6 with the tools I use day-to-day at my work, like Cursor and Claude Code. They offer a much better dev experience than more basic, non-agentic tools like Ollama. Try out other local models, like Google’s Gemma 4. Curious to see how it stacks up. submitted by /u/sh_tomer

Originally posted by u/sh_tomer on r/ClaudeCode

Opinion: Local LLMs are 12-24 months from replacing Opus

Opinion: Local LLMs are 12-24 months from replacing Opus