Original Reddit post

I spent the entire day trying to turn a vague community complaint into an auditable experiment. The question was this. Has Claude Code actually gotten worse on engineering tasks? And, if so, which knob actually changes anything? Instead of relying on gut feeling, I built a full benchmark campaign and kept refining the design until the noise dropped out. In the end, I ran 386 executions, spent about $55.40, discarded a lot of false signals, and found only one result that was truly reproducible when I stopped varying effort, adaptive, and CLAUDE.md , and compared models instead. What was tested Over the course of the campaign, I compared these conditions. baseline –effort high –effort max CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 MAX_THINKING_TOKENS a short process-focused CLAUDE.md the combination of CLAUDE.md plus adaptive off the real interactive TUI and finally, Opus 4.6 [1M] vs Opus 4.5 [200k] I also changed the type of benchmark over time. artificial sandbox redesigned benchmark engineering-shaped tasks real repository subsets local issue replay with git worktree interactive TUI direct model comparison a confirmatory round focused only on the one task that showed separation How the benchmarks were built In every more serious round, I tried to keep as much control as possible. fresh process for each run isolated worktree for each run untouched main checkout real tests as the oracle, using vitest a scorer with both outcome and process metrics The observed metrics included these. correct partial tests_pass workaround_or_fakefix read_before_edit thrashing files_read_count files_changed_count unexpected_file_touches tool_call_count duration_s estimated cost So I did not measure only whether it passed or failed. I measured how the agent reached the fix. Summary of the full campaign v1, 160 runs, $14.88, synthetic microbenchmarks, result was saturation v2, 104 runs, $10.20, redesigned synthetic benchmark, result was saturation v3, 32 runs, $3.20, engineering-shaped tasks, result was saturation v4, 32 runs, $8.01, real repository subsets, result was saturation, and --effort max was slower with no gain v5, 24 runs, $7.06, local issue replay with git worktree, result was an n=1 signal that collapsed at n=2 v6 TUI, 12 runs, about $6.00, real interactive TUI, result was an n=1 signal that collapsed at n=2 v7 model compare, 12 runs, $3.03, 4.6 vs 4.5, result was the first reproducible signal v8 confirmatory, 10 runs, $3.02, n=5 confirmation on the only discriminative task, result confirmed the signal Total, 386 runs and about $55.40. The most interesting part is that the only truly useful result showed up at the end. Everything before that mostly mapped what saturated and what was just noise. What was built in each phase v1, synthetic microbenchmarks I started with tightly controlled tasks to see whether any knob changed basic behavior. I used four prompt types. short deterministic response short reasoning trap tool use with file counting simple edit with read-before-edit The logic was straightforward. If effort or adaptive really changed basic discipline, that should already appear in small, fully observable tasks. It did not appear in a robust way. The only useful signal came from an ambiguous counting prompt, but that turned out to be an artifact of the benchmark design itself. The prompt referred to 3 files while the directory contained 4. Once that ambiguity was removed, the effect disappeared. v2, redesigned synthetic benchmark I rebuilt the tasks to remove the accidental ambiguity from v1. I created cleaner tasks with better scoring, while still keeping them small. counting with no ambiguity conflict checking multi-file text update simple bug fix The logic here was to separate “the model got better” from “the prompt was messy.” The result was saturation again. All conditions converged to the correct answer, with differences only in latency and verbosity. v3, engineering-shaped tasks At this point, I moved away from pure microbenchmarks and tried to simulate work that looked more like real engineering. multi-file diagnosis refactor with invariants fake-fix trap convention adherence The logic was simple. Measuring accuracy alone is not enough. You also need to detect whether the agent reads the right context preserves invariants falls into a workaround or ignores local conventions Even so, the round saturated in binary accuracy, with 32 out of 32 correct, even though the oracles were correct and validated by sanity checks. In other words, the scorer was not the problem. The tasks were still too easy for Opus 4.6. v4, real repository subsets At this stage, I stopped inventing benchmark code and started deriving the tasks directly from apps/web-client in /srv/git/snes-cloud, a private repository I have had on hold. The four selected families were these. parity or missing-key diagnosis display-mode invariant update error parser mapping bug local conventions sandbox The logic in v4 was to use real code, with minimal subsets, while still keeping local and controlled oracles. The result improved methodologically, but not statistically. The pilot saturated again. The correct decision at that point was not to scale it up. The v5 benchmark, where the design started to become useful v5 was the first benchmark that I consider genuinely good from the standpoint of reproducing something close to a local issue replay. It had two real tasks, both derived from apps/web-client, running in isolated git worktree environments. Task 1, t1_i18n_parity This task started with a minimal mutation that removed a key from pt.ts, while en.ts remained the canonical table. To solve it correctly, the agent had to do the following. read src/i18n/parity.test.ts compare src/i18n/pt.ts with src/i18n/en.ts verify the real usage in src/api/error-parser.ts conclude that the correct fix was to restore the missing key in pt.ts not “fix” the problem by deleting the same key from en.ts So this task tested cross-file diagnosis, canonical source selection, and workaround detection. Task 2, t2_error_parser This task introduced a bug in src/api/error-parser.ts by breaking the mapping from an error code to its i18n key. The logic of the test was this. the agent had to locate the cause in the mapping table the correct fix had to be structural the fake fix was to add an ad hoc if inside parseApiError So the goal here was to distinguish structural correction from an opportunistic patch. v5 result 24 runs $7.06 0 workarounds 0 fake fixes 24 out of 24 correct There was a process signal at n=1, but it weakened at n=2. The honest conclusion is that it still was not robust. The v6 benchmark, real interactive TUI Because the community keeps insisting that “the problem is the interactive session, not claude -p,” I built v6 specifically for that. The hardest part was not the task itself. It was the TUI instrumentation. I validated the following. pty.fork running claude “” in TUI mode terminal reconstruction with pyte parsing of the raw PTY stream I also found an important complication. The TUI collapses multiple tool calls into outputs such as Read 4 files, instead of emitting granular events like Read(path) on the final screen. That forced me to adapt the scorer so it extracted counts from the raw stream, not just from the rendered scrollback. The v6 task was a TUI version, with more context, of the i18n parity problem, with explicit required prior reading of these files. parity.test.ts pt.ts en.ts error-parser.ts The logic was to measure these items. files_read_before_first_edit thrashing time_to_first_edit time_to_first_test tool_call_count self-correction loops v6 showed an interesting process signal at n=1, but that signal did not survive at n=2. So it helped cover the gap of “real TUI,” but it did not support a strong conclusion. Where the first reproducible signal appeared, v7 The real turning point came when I stopped changing effort, adaptive settings, and prompt variants, and compared only the model. In v7, I kept everything else fixed. the v5 benchmark the same two tasks the same worktrees the same prompts the same scorers I changed only the model. M45, claude-opus-4-5-20251101 M46, Opus 4.6 default in the environment That produced the first signal that did not collapse at n=2. v7 result Final outcome. 8 out of 8 correct for both models 0 workarounds 0 scope violations But on t1_i18n_parity, a process difference appeared. read_before_edit, 1.00 vs 0.50 thrashing, 0.00 vs 0.50 n_tool_calls, 6.0 vs 9.5 duration, 30.5s vs 36.3s cost per run, $0.2164 vs $0.2848 This was the first result in the entire campaign that showed up at n=1 and remained standing at n=2. The final confirmatory round, v8 Once v7 finally showed a real signal, I did the right thing. I did not open a new benchmark. I simply repeated the same task that had shown separation, now with n=5 per model, and both models explicitly forced by flag. The single task was this. t1_i18n_parity The models were these. claude-opus-4-5-20251101 claude-opus-4-6[1m] Final outcome Complete tie. correct, 5 out of 5 vs 5 out of 5 tests_pass, 5 out of 5 vs 5 out of 5 workaround_or_fakefix, 0 vs 0 So the two models delivered the same final quality. Process Here the signal became genuinely clear. M45, n=5 read_before_edit, 5 out of 5, or 100% thrashing, 0 out of 5, or 0% n_tool_calls, 5.80 duration_s, 30.47s cost per run, $0.2835 M46, n=5 read_before_edit, 2 out of 5, or 40% thrashing, 3 out of 5, or 60% n_tool_calls, 9.60 duration_s, 36.48s cost per run, $0.3213 Differences. read_before_edit, minus 60 percentage points for M46 thrashing, plus 60 percentage points for M46 tool calls, plus 66% for M46 duration, plus 20% for M46 cost, about 12% higher for M46 The internal mechanism also became very clear. In 3 of 5 runs, M46 followed the same bad pattern. edit before read -> detect the need to redo -> second edit The 2 M46 runs that read first did not thrash. M45 followed the clean pattern in 5 out of 5 runs. What this really means What I can state I could not show that --effort high, --effort max, disabling adaptive thinking, or a short CLAUDE.md reliably recover quality on small or medium local tasks. I was able to show a difference between Opus 4.5 and Opus 4.6, but that difference was in workflow discipline latency cost process consistency There was no difference in final correctness. What I cannot state I cannot claim any of the following. “4.5 is better at everything” “this applies to the entire community” “this applies to very large sessions that actually use the full 1M context” “this applies to long TUI sessions, multi-day workflows, or much larger codebases” The real scope is much narrower. the v5 benchmark web-client in TypeScript the t1_i18n_parity task headless -p mode relevant context below 100k n=5 per model in the confirmatory round Cost impact This part was objective. In the confirmatory round. M45, $0.2835 per run M46, $0.3213 per run That is a savings of about 12% per run for 4.5 on this confirmatory task. In v7, the preliminary difference had been larger. $0.2164 vs $0.2848 roughly 32% per run So the first signal appeared more exaggerated in the pilot, and the confirmatory round stabilized it at a more conservative value. Practical conclusion The best reading I can make of the data is this. For small to medium localized fixes with a local oracle and context well below 200k, Opus 4.5 / 200k was the better choice in my benchmark because it delivered the same final quality less thrashing more read_before_edit fewer tool calls lower cost lower latency For sessions that truly require more than 200k context, this report did not measure enough to justify preferring 4.5. The biggest lesson from the campaign was this. The problem was not lack of sample size. It was that I was varying the wrong knob. Effort, adaptive settings, and prompt nudges almost always saturated. Model and context window were the first axis that produced a reproducible signal. Final operational recommendation If I had to turn this into a usage rule, it would be this. Use Opus 4.5 / 200k by default for localized fixes single-module work local oracles such as vitest moderate context workflows where investigation discipline matters Use Opus 4.6 / 1M when you truly need more than 200k context the session is larger than what I was able to measure in this benchmark or you prioritize stricter adherence to short output instructions submitted by /u/vittoroliveira

Originally posted by u/vittoroliveira on r/ClaudeCode