Original Reddit post

We’re a small team (Codeflash — we build a Python code optimization tool) and we’ve been using Claude Code heavily for feature development. It’s been genuinely great for productivity. Recently we shipped two big features — Java language support (~52K lines) and React framework support (~24K lines) — both built primarily with Claude Code. The features worked. Tests passed. We were happy. Then we ran our own tool on the PRs. The results: Across just these two PRs ( #1199 and #1561 ), we found 118 functions that were performing significantly worse than they needed to. You can see the Codeflash bot comments on both PRs — there are a lot of them. What the slow code actually looked like: The patterns were really consistent. Here’s a concrete example — Claude Code wrote this to convert byte offsets to character positions:

Called for every AST node in the file start_char = len(content_bytes[:start_byte].decode(“utf8”)) end_char = len(content_bytes[:end_byte].decode(“utf8”))

It re-decodes the entire byte prefix from scratch on every single call. O(n) per lookup, called hundreds of times per file. The fix was to build a cumulative byte table once and binary search it — 19x faster for the exact same result. ( PR #1597 ) Other patterns we saw over and over: Naive algorithms where efficient ones exist — a type extraction function was 446x slower because it used string scanning instead of tree-sitter Redundant computation — an import inserter was 36x slower from redundant tree traversals Zero caching — a type extractor was 16x slower because it recomputed everything from scratch on repeated calls Wrong data structures — a brace-balancing parser was 3x slower from using lists where sets would work All of these were correct code. All passed tests. None would have been caught in a normal code review. That’s what makes it tricky. Why this happens (our take): This isn’t a Claude Code-specific issue — it’s structural to how LLMs generate code: LLMs optimize for correctness, not performance. The simplest correct solution is what you get. Optimization is an exploration problem. You can’t tell code is slow by reading it — you have to benchmark it, try alternatives, measure again. LLMs do single-pass generation. Nobody prompts for performance. When you say “add Java support,” the implicit target is working code, fast. Not optimally-performing code. Performance problems are invisible. No failing test, no error, no red flag. The cost shows up in your cloud bill months later. The SWE-fficiency benchmark tested 11 frontier LLMs like Claude 4.6 Opus on real optimization tasks — the best achieved less than 0.23x the speedup of human experts. Better models aren’t closing this gap because the problem isn’t model intelligence, it’s the mismatch between single-pass generation and iterative optimization. Not bashing Claude Code. We use it daily and it’s incredible for productivity. But we think people should be aware of this tradeoff. The code ships fast, but it runs slow — and nobody notices until it’s in production. Full writeup with all the details and more PR links: BLOG LINK Curious if anyone else has noticed this with their Claude Code output. Have you ever benchmarked the code it generates? submitted by /u/ml_guy1

Originally posted by u/ml_guy1 on r/ClaudeCode