I ran a small experiment comparing planning quality across models/effort levels on a non-trivial software design task. The task was not a CRUD feature. The design document described a refactor of a large-scale import/ingestion pipeline involving distributed workers, async message processing, eventual consistency, stale worker handling, versioned writes, retryable artifact processing, absence cleanup, aggregate/canonical state updates, and outbox/materialization recovery. The core architectural problem was introducing a two-fence correctness model:
- a per-source generation counter to determine whether work from a given import is still authoritative
- a per-owner write sequence to provide last-write-wins ordering across multiple feeds that can write the same logical entity The important constraint was that the first implementation PR had to be small and reviewable. It should introduce the new generation/write-sequence plumbing while preserving existing behavior: no renaming major models, no deleting completion tracking, no redesigning image/artifact storage, no rewriting the outbox system, etc. I gave the same detailed prompt and design document to four agents:
- Agent 1: Opus 4.7, max effort
- Agent 2: Opus 4.6, high effort
- Agent 3: Opus 4.6, max effort
- Agent 4: Opus 4.7, xhigh effort The ranking ended up being: Agent 4: best overall Agent 1: very close second Agent 2: usable, but thinner Agent 3: clearly worst Agent 4 produced the strongest plan. It understood that Phase 1 needed to stay narrow: allocation, message plumbing, and the primary product write fence only. It also handled important edge cases like nullable write sequences for non-product imports, stale insert prevention, in-session tie-breaking, and delaying brand/image/canonical rewrites to later phases. Agent 1 was also strong and very detailed, but slightly overreached in Phase 1 and made a few schema assumptions that would need correction. Agent 2 had some good tactical ideas, especially around nullable write sequences and stale insert prevention, but the later phases were much less complete. Agent 3 was interesting because it used higher effort than Agent 2 on the same base model, but produced a worse plan. It missed key requirements like storing the write sequence on the import/session row, allocating sequence at session creation, and using atomic database predicates instead of mostly in-memory checks. The main takeaway for me was that effort level helped, but it did not dominate. The newer model at high/xhigh effort was clearly better, but simply increasing effort on the older model did not guarantee better planning. For this kind of architectural planning task, the difference showed up less in prose quality and more in whether the model preserved phase boundaries and caught subtle correctness constraints. submitted by /u/silveroff
Originally posted by u/silveroff on r/ClaudeCode
You must log in or # to comment.
