Opus 4.8 High answered four (Monday, Wednesday, Thursday, Sunday). Opus 4.6 running with extended thinking (via Antigravity IDE) got it correct immediately, noting the universal “-day” suffix on the first reasoning step. Here’s what I think is going on technically:
- The Failure Mode: Character-Level Scanning Without Pattern Abstraction This is a textbook example of a model doing element-wise enumeration when it should be doing pattern-level deduction . The optimal reasoning path is: Recognize that all English weekday names share the morpheme “-day” Note that “day” contains “d” Conclude: 7/7 Instead, Opus 4.8 took the brute-force path: list all seven days, then scan each string individually for the character “d.” This approach is fragile because tokenization means the model isn’t operating on raw characters — it’s operating on tokens. BPE tokenization chunks words into subword units that don’t necessarily align with individual characters. When a model “scans for a letter,” it’s essentially simulating character-level operations on top of a token-level representation, which is inherently lossy and error-prone. The interesting part: the model found “d” in Monday, Wednesday, Thursday, and Sunday, but missed Tuesday, Friday, and Saturday. If you look at where the “d” sits in each word: The “d” is in the exact same position in every word (the “-day” suffix), yet the model inconsistently detected it. This is a strong signal that the model was not reasoning about the structural pattern and was instead doing unreliable token-level character simulation.
- Why Extended Thinking (CoT) Catches This Opus 4.6 with extended thinking uses a dedicated chain-of-thought scratchpad before producing the final answer. This architectural difference matters here because: Pattern recognition over enumeration: The thinking trace has space to step back and notice “wait — they all end in -day” before committing to a brute-force character scan. Opus 4.8 at “High” effort likely allocated some internal reasoning, but not enough to catch the meta-pattern. Self-verification: Extended thinking models tend to produce an answer, then verify it within the thinking trace. Even if the model initially started scanning character-by-character, the verification pass would catch the inconsistency: “If Monday has ‘d’ because of ‘-day,’ then so does every other day.” The effort slider matters: “High” is not the maximum effort level on claude.ai. Opus 4.8 offers effort tiers (High, xHigh, Max). It’s plausible that at “Max” effort, 4.8 would have caught this.
- Newer ≠ Universally Better: Benchmark Regression Is Real Opus 4.8 scored 69.2% on SWE-Bench Pro — a genuine improvement over 4.7 on complex, multi-file software engineering tasks. But benchmark gains on hard tasks don’t guarantee preservation of performance on easy ones. This is a known phenomenon in the ML literature: Catastrophic forgetting during fine-tuning can degrade performance on previously-solved task categories Alignment tuning trade-offs — RLHF/RLAIF optimization toward honesty, helpfulness, and safety can shift the model’s probability mass in ways that introduce regressions on simple pattern-matching Capability vs. reliability — A model can gain the ability to solve harder problems while becoming less reliable on easier ones, especially when those easy problems have deceptive surface structure (looks like it needs enumeration, actually needs abstraction) submitted by /u/PlefkowQuatir-41
Originally posted by u/PlefkowQuatir-41 on r/ArtificialInteligence
You must log in or # to comment.
