Heya folx! I used the last of my weekly cap to do an analysis of effort in 4.7 after reading the work of bisonbear2 and Deep-Palpitation8315, wanted to share Prompt Structure vs Effort Tier on Architectural-Review Tasks A controlled comparison across two prompt-quality conditions and three effort tiers. TL;DR N=16 agents total (8 per round, 4 repos × 2 effort tiers each) Effort numerically encoded : low=1, med=2, high=3, xhigh=4, max=5 Round 1 (loose prompt): medium=2 vs xhigh=4 → +1.0 point per effort step Round 2 (structured prompt): medium=2 vs high=3 → 0.0 points per effort step Prompt structure saturates the score the effort knob would otherwise climb toward Converges with bisonbear2 (N=29) and Deep-Palpitation8315 (N=10) — both non-monotonic on Opus 4.7 Why emulators? We needed a task class that stressed architectural-review capability , not surface-level summarization. Emulators were chosen because: High architectural surface area : CPU decoder, OS modeling, syscall layer, loader, device tables, networking, observability, license analysis Rich cross-cutting concerns : threat model, performance budget, host/shader split, license posture Existing prior-work baselines : REVIEW.md gives every agent the same starting point and exposes novelty headroom variance per repo Domain match : substrate itself is an emulator, so improvements are immediately actionable .add -shaped output: forces concrete architectural proposals, not feature summaries Failure modes are interesting : agents can be spec-correct yet wrong about substrate’s lane (good wrong-axis tests) Trivial tasks (single-file changes, lint cleanups) wouldn’t have produced a discriminating signal at any effort tier. Methodology Participants Model : Claude Opus 4.7 (default –model opus[1m] ) Harness : claude_direct.sh foreground, dispatched via background-Bash for parallelism Grader : GPT-5.5 via codex_direct.sh (planner tier) Effort encoding All math below uses these values. Repos (4) Design 2 agents per repo per round (1 lower-tier + 1 higher-tier) Numeric labels (01-08) interleaved so the grader could not infer effort from filename ordering Blinded grader : did not know which file = which repo + effort Round structure Rubric (both rounds) 5 axes × 0-10 = 50 max per file: Specificity — file:line anchoring density Substrate-relevance — real gap closure vs generic suggestions -3. Implementability — Codex-executable-today shape Novelty — beyond prior baseline Analysis quality — depth + anti-imports + sequencing Round 1 results (loose prompt) Round 1 aggregate Round 2 results (structured prompt) Round 2 aggregate Prompt-quality divergence Holding effort = medium (tier 2) constant across rounds: Prompt structure alone gained +1.0 points at the same effort tier. Effort gain per step, by prompt condition The structured prompt absorbed roughly one full effort-tier of gain. The two knobs are substitutes, not additives , at least within the effort range tested. Score-distribution stability Same ceiling, same floor, same spread across rounds. Neither knob moved the extremes — they only moved the average. Per-repo direction 3 of 4 repos flipped winner between rounds → per-repo direction is noise-dominated at N=2 per cell Brovan is the only consistent case — and Brovan has the deepest prior-work baseline (landed 14-phase gap analysis), so easy novelty was already exhausted Hypothesis : when shallow novelty is exhausted, deeper reasoning still helps. When it isn’t, prompt structure suffices. Cross-corroboration with external benchmarks Three independent studies, three different task classes, same direction: non-monotonic effort curves on Opus 4.7 . Practical implications Default for review-class tasks Use medium (tier 2) with a structured prompt Where to spend first Prompt structure — §-sections, anchoring rules, novelty discipline, wrong-axis specificity Grader methodology — read-all-then-score, absolute rubric anchors Effort tier — only if measurements show the prompt has saturated Where higher effort still helps Repos with exhausted shallow-novelty baselines (Brovan-class) Cross-cutting refactors requiring multi-file reasoning chains Domains where prompt structure can’t capture domain-specific failure modes Anti-patterns Paying for xhigh / max on a vague prompt — buys ~1 point per tier, capped at the structured-prompt ceiling Treating “more tokens = better” as a law — Opus 4.7’s adaptive-thinking docs explicitly warn against this for max Optimizing one knob without measuring the other Limitations Multiple variables changed between rounds Prompt template : loose → structured Grader methodology : sequential read → read-all-then-score + absolute anchors Either could account for some of the observed variance Our claim is about prompt structure; grader-methodology improvements could partially explain the Round 2 result. Mitigation: both rounds shared the same ceiling/floor/spread, suggesting the underlying signal is bounded by something other than grading noise. Asymmetric effort coverage Round 1 tested medium vs xhigh (2 effort-tier steps apart) Round 2 tested medium vs high (1 effort-tier step apart) xhigh was not retested under the structured prompt We cannot rule out that structured-prompt + xhigh would exceed structured-prompt + high. The “prompt saturates effort” claim is supported within tested tiers (2 and 3 under structured), but extrapolation to tier 4-5 is not yet evidenced A complete factorial design would require: structured × {low, medium, high, xhigh, max} — 5 more agents per repo Statistical power N=2 per (effort × repo) cell — per-repo direction noise-dominated 4 repos sampled for substrate-relevance, not random Single grader per round; no inter-rater reliability test Effort tier labels are Anthropic-side adaptive-policy biases, not directly measurable reasoning-token budgets Domain specificity Findings apply to architectural-review tasks on emulator codebases Coding/execution tasks (bisonbear2, Deep-Palpitation8315) show the same direction but the load-bearing axis may differ Generalizing to other task classes (debugging, refactoring, planning) requires separate measurement References bisonbear2 (2026, May). Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo . r/ClaudeAI . https://stet.sh/blog/opus-47-graphql-reasoning-curve Deep-Palpitation8315 (2026, May). Round 3: Claude Opus 4.7 1M vs Opus 4.7 vs Opus 4.6 Legacy vs Sonnet 4.6 across effort levels on the same real React feature-build task . r/ClaudeCode . Anthropic (2026). Claude Code model configuration docs — adaptive-thinking + max-effort diminishing-returns warning. Hypothesis log (pre-R2-grade) Effort delta helps MORE when the prompt is poorly written. A well-written prom-pt narrows the quality/efficiency gap between effort tiers because much of “higher effort” is the model self-elaborating structure, anchoring, and rubric awareness that a strong prompt provides explicitly. Predicted : per-step effort delta in R2 < per-step effort delta in R1 Observed : R1 = +1.0/step, R2 = 0.0/step Result : confirmed submitted by /u/jonaswashe
Originally posted by u/jonaswashe on r/ClaudeCode
