eifachposte

eifachposte

m3 just dropped and the benchmark numbers are solid — 59.0 on SWE-Bench Pro, 83.5 on BrowseComp, plus the agent/tool scores people are already arguing about. but here’s what i keep coming back to: none of those numbers really measure the thing that wastes my time in coding agents. claude code and cursor are great when you need a quick isolated script or a ui tweak. but drop an agent into a messy legacy repo and ask it to make a cross-file change, and it falls apart. the agent fixes one file, breaks some undocumented dependency in another, then starts patching its own broken patch. forty minutes later your git history looks like a crime scene and you’re wondering why you didnt just write the damn thing yourself. everyone’s been chasing the same fix: bigger context windows, higher benchmark scores, more tokens. the assumption is that if the model is smart enough, it’ll stop making stupid mistakes. but in practice, most of the failures i hit aren’t about raw intelligence. they’re about the model not knowing what it doesn’t know — charging into a patch without checking whether the thing it’s about to edit is tangled into five other files maintained by three different teams over four years. so when m3 dropped, i didn’t test it by asking it to write code. i tested it by asking what could go wrong before any code was written. i pointed it at a feature change in a project i maintain — updating an internal api interface that a handful of downstream services depend on. not a huge task, but the kind where a careless refactor creates silent bugs that surface three days later in production. before letting it touch anything, i asked: “walk through this repo and tell me what you’d want to verify before making this change. what files are risky? what assumptions are baked into the current interface that a naive refactor would miss? where would you put a verifier check?” what came back wasn’t a todo list or a code snippet. it was closer to a pre-mortem. it traced the interface through every import path it could find and flagged which downstream callers were relying on implicit type coercion rather than explicit contracts. it pointed at two wrapper functions that looked like dead code but were actually being reached through a dynamic import buried in a config file. it suggested which test files covered the happy path but not the edge case that the refactor would create. and it proposed a sequence: run these specific tests first, then touch these files in this order, and stop and verify at these three checkpoints before proceeding further. honestly, this is the stuff i usually figure out the hard way — twenty minutes into debugging a broken build, scrolling through grep results, muttering “who wrote this wrapper and why does it exist.” the model spotted most of it before writing a single line. that’s the part that stood out to me. not benchmark numbers. not token-per-second stats. just the fact that it tried to build a failure map before picking up the hammer. now, the catch. it overcooks this sometimes. on a smaller tweak — the kind where you just need to rename a parameter and update six call sites — it’ll still give you the full risk assessment treatment. dependency graphs, test gap analysis, a suggested rollback plan. useful if you’re refactoring a payment pipeline, annoying if you’re fixing a typo in a log message. the model defaults to audit mode, and you have to explicitly tell it “this is small, just check the obvious stuff and go.” but honestly? i’ll take that trade. i’ve been burned by overconfident patches way more often than overcautious ones. an agent that asks “here’s what might break” before it acts is closer to what i actually want than an agent that scores 5% higher on a benchmark and still blows up my build in production. the updated minimax code tool is interesting in this context too. they’re pairing a producer agent with a completely independent verifier — separate session, no shared history, the verifier only sees the proposed changes and the original spec. if the producer’s “pre-mortem” output becomes the verifier’s checklist, you’ve got something that looks less like vibe coding and more like an actual review process. i’m still skeptical about how independent the verifier really is when it’s running on the same underlying model, but conceptually, the shape fits the problem. i’m not claiming m3 is the fastest coder or the best daily driver. opus still has sharper instincts about when to just shut up and write the function. but m3 is the first model i’ve used where the pre-patch risk assessment felt like something i’d actually want in my workflow, rather than a fancy feature i’d ignore after two days. anyone else tried using these models for the “look before you leap” step rather than the coding itself? does the over-caution get in the way on smaller tasks, or do you just learn to scope your prompts differently? submitted by /u/SaiVaibhav06

Originally posted by u/SaiVaibhav06 on r/ClaudeCode

M3 scores well on SWE-Bench but that's not why I'm impressed it's the stuff no benchmark measures

M3 scores well on SWE-Bench but that's not why I'm impressed it's the stuff no benchmark measures