Original Reddit post

Turns out, Opus 4.6 can hold the full trace in context and reason about internal consistency across steps (it doesn’t evaluate each step in isolation.) It also catches failure modes we never explicitly programmed checks for. (Trace examples: https://futuresearch.ai/blog/llm-trace-analysis/ ) We gave Opus 4.6 a Claude Code skill with examples of common failure modes and instructions for forming and testing hypotheses after trying this before with Sonnet 3.7, but a general prompt like “find issues with this trace” wouldn’t work because Sonnet was too trusting. When the agent said “ok, I found the right answer,” Sonnet would take that at face value no matter how skeptical you made the prompt. We ended up splitting analysis across dozens of narrow prompts applied to every individual ReAct step which improved accuracy but was prohibitively expensive. Are you still writing specialized check-by-check prompts for trace analysis, or has the jump to Opus made that unnecessary for you too? submitted by /u/MathematicianBig2071

Originally posted by u/MathematicianBig2071 on r/ClaudeCode