eifachposte

eifachposte

To write code, I use Claude with superpowers, which finishes development with a code review round. I figure the more code reviews the better, so after finishing a change I have a /finish-branch command which: rebases on the latest main runs /simplify runs /code-review runs superpower’s /requesting-code-review creates a PR on Github waits for Copilot’s final review. It’s quite a lengthy process but proceeds quietly in the background while I can get on with something else. But with the time it takes plus the change in how Microsoft charges for Copilot I thought it was worth reviewing how redundant these 3 extra rounds of code reviews are and whether any is worth skipping. So I added a final stage to /finish-branch where Claude recorded some stats about the code reviews from each reviewer. It turns out each reviewer has a different approach to reviewing which is complementary: The in-session reviewer ( /code-review ): An AI that has been in the conversation while the code was written. It knows why a thing is the way it is, and it can also do cleanup and apply its own fixes. (included in my monthly quota) The fresh-context reviewer ( /requesting-code-review ): A second AI instance with no memory of the session. It re-derives everything from the diff and is, by design, adversarial, and it doesn’t give you the benefit of the doubt. (included in my monthly quota) Copilot: A third-party automated reviewer that comments on the pull request itself. It shares no context with you and isn’t anchored to your test suite. (included in my monthly quota, but with possibly expensive overruns) I thought Claude’s findings (reported below in his own words) were interesting and reassuring, which is why I’ve posted it here. How I (Claude) scored it: For each reviewer on each branch I recorded four numbers: total findings, unique (only that reviewer raised it), applied (real enough to actually fix), and false-positive (dismissed). The number that matters is the intersection: unique-applied — a real defect that only that reviewer caught. Total findings are vanity; a reviewer that re-reports what the others already found adds nothing. Unique-applied is the differentiated value, and it’s the only thing that justifies keeping a reviewer in the pipeline — paid or not. Finding 1: all three earn their keep, and it’s not close to a tie-breaker Across the runs where each actually fired, every reviewer produced at least one unique-applied finding in roughly two of every three branches: What makes that striking: most of these diffs had already been through several rounds of review before these three ever ran. Fresh eyes on the assembled whole kept finding real bugs anyway. The lesson generalizes — review fatigue is real, but a genuinely independent pass is not just re-checking; it’s checking differently. Finding 2: my redundancy hypothesis was exactly backwards Going in, I assumed the two fresh-context reviewers — the AI one and Copilot — would be the redundant pair. Same job, no session history, surely they’d find the same things. The opposite was true. They were the most complementary pair in the set. “No overlap — three disjoint find sets” was the single most common note on the scorecard. Copilot repeatedly caught bugs that both AI reviewers missed. The reason is vantage point, not intelligence. The bot isn’t smarter. It’s differently blind. The in-session reviewer can be anchored by your intent; both AI reviewers, when a change is covered by a passing test suite, tend to defer to it. Copilot doesn’t share your context and doesn’t trust your tests — so it sees the class of bug that’s invisible from inside. Finding 3: each reviewer owns a structural category This is the part worth internalizing, because it tells you when each one pays: The in-session reviewer is best at parallel-surface bugs — the same logic duplicated across several call sites where you fixed one and forgot the rest — and at catching its own mistakes from the cleanup it just did. Context is its edge. The fresh-context reviewer is the only reliable stale-base guard. More than once it caught that a branch was built on an old mainline and a naïve merge would have silently reverted already-shipped work. It’s also the one that adversarially verifies load-bearing claims and notices tests that pass vacuously. Skepticism is its edge. Copilot owns the test-blind classes: visual/computed-CSS bugs, missing accessibility labels, resource and listener leaks, undeclared dependencies, and unvalidated data at trust boundaries. Exactly the things a test-anchored reviewer waves through. Detachment is its edge. So is Copilot worth its cost? Yes — but the cost asymmetry should drive how you run it, not whether. The two flat-rate AI reviewers have no marginal cost, so the policy is trivial: run them on everything. There’s no economic decision to make. The fresh-context one in particular is cheap insurance against the most expensive class of mistake (silently undoing shipped work), so it’s the last one I’d ever drop. The metered reviewer is the only one where each run actually spends money, so it’s the only one that deserves a targeting rule. The data says its unique value concentrates in specific diff shapes: Spend on Copilot when the diff touches its categories — UI/styling, accessibility, anything with a resource lifecycle, dependency or config/manifest changes, or code that ingests untrusted data. And especially when the change is heavily test-covered, because that’s precisely when your flat-rate reviewers share a blind spot and Copilot doesn’t. Skip it when the change is pure internal logic already shaken by the two free reviewers, or when it’s trivial — docs, one-liners, mechanical renames. On those, its expected unique yield is low and you’re paying for near-certain silence. The takeaways that generalize Score reviewers on unique-applied finds, not total noise. A reviewer that echoes the others is free to drop no matter how much it “finds.” Diversity of vantage point beats diversity of nothing. Independent reviewers with different anchoring — context-aware vs. fresh, internal vs. external, test-trusting vs. test-skeptical — catch genuinely disjoint sets. Stacking two reviewers that see the same way is the actual redundancy trap. Let cost shape cadence, not inclusion. Flat-rate reviewers: run by default, the decision is already paid for. Metered reviewers: keep them, but fire them at the diff types where their unique coverage is highest. The goal isn’t to minimize reviewers — it’s to spend marginal money only where it buys coverage nothing else gives you. The reviewer that costs money turned out to be worth it. But I’d never have known without counting, and counting is what told me exactly when not to pay. submitted by /u/Fragrant-Coast5355

Originally posted by u/Fragrant-Coast5355 on r/ClaudeCode

Are Copilot reviews worth it when Claude has already reviewed its changes?

Are Copilot reviews worth it when Claude has already reviewed its changes?