eifachposte

eifachposte

I ran the same legal contract review task on both Opus 4.6 and 4.8 — a 5-year framework agreement with an attached 39-clause General Purchase Conditions document, using identical prompts and a detailed rule system (format constraints, cross-reference verification, density limits, output structure requirements). Key findings: Cross-reference verification: The contract contains 6 known citation errors (clauses referencing wrong articles). 4.6 caught all 6. 4.8 caught only 2-3 — it literally mentioned the wrong article numbers in its output but didn’t realize they pointed to the wrong content. Same blind spot as 4.7. Density discipline: Rules require risk descriptions to stay within ~3-4 lines. 4.6 stayed within spec throughout. 4.8 systematically exceeded limits (5-8 lines per item), merging multiple clauses into long analytical blocks. Format boundaries: Both models respected the “no draft contract language in recommendations” rule — an improvement for 4.8 over 4.7, which systematically drafted English contract clauses where it shouldn’t have. TL;DR: 4.6 is the more reliable rule-follower; 4.8 is a stronger analyst who doesn’t follow instructions. 4.8’s own self-diagnosis after being confronted: “I followed the hard gates (process blockers) but ignored every soft rule (text-based guidelines with no enforcement mechanism).” That’s an honest and accurate summary of what happened. now dont know what to do when they one day decide to discontinue 4.6 submitted by /u/SeaFlounder7336

Originally posted by u/SeaFlounder7336 on r/ClaudeCode

Opus 4.8 vs (<) 4.6: Rule compliance benchmark on a real contract review task

Opus 4.8 vs (<) 4.6: Rule compliance benchmark on a real contract review task