eifachposte

eifachposte

I built a harness that drives Claude Code agents to ship production code autonomously. Four mandatory review gates between every generated artifact and every release. After 97 days and 5,109 classified quality checks, the error patterns were not what I expected. Hallucinations were not my top problem. Half of all the issues were omissions where it just forgot to do things or only created stubs with // TODO. The rest were systemic, where it did the same wrong thing consistently. That means the failures have a pattern, and I exploited that. The biggest finding was about decomposition. If you let a single agent reason too long, it starts contradicting itself. But if you break the work into bounded tasks with fresh contexts, the error profile changes. The smaller context makes it forget instead of writing incoherent code. Forgetting is easier to catch. Lint, “does it compile”, even a regex for “// TODO” catches a surprising chunk. The agents are pretty terrible at revising though. After a gate rejection, they spend ridiculous time and tokens going in circles. I’m still figuring out the right balance between rerolling versus revising. I wrote up the full data and a framework for thinking about verification pipeline design: https://michael.roth.rocks/research/trust-topology/ Happy to discuss the setup, methodology, or where it falls apart. submitted by /u/mrothro

Originally posted by u/mrothro on r/ClaudeCode

97 days running autonomous Claude Code agents with 5,109 quality checks. Here's what actually breaks.

97 days running autonomous Claude Code agents with 5,109 quality checks. Here's what actually breaks.