A few weeks ago, I was evaluating an agent I’d built, and it kept giving me different answers on the same task. I thought I was doing something wrong. Turns out I wasn’t. The agent just… disagrees with itself. That annoyed me enough to actually study it. We ran 3,000 experiments — same tasks, same prompts, same inputs — across Claude, GPT-4o, and Llama. Key findings: Consistent agents hit 80–92% accuracy . Inconsistent ones: 25–60% . That’s a 32–55 point gap. 69% of divergence happens at the very first tool call — the initial search query. Get that right and all downstream runs converge. Get it wrong and runs scatter. Path length is a cheap signal: agents taking 8 steps on a 3-step task are usually lost, not thorough. Practical takeaway: run your agent 3–5x in parallel. If trajectories agree, trust it. If they scatter, don’t ship it. Paper: https://arxiv.org/abs/2602.11619 Writeup: https://amcortex.substack.com/p/run-your-agent-10-times-you-wont Hope this helps! submitted by /u/Aggravating_Bed_349
Originally posted by u/Aggravating_Bed_349 on r/ArtificialInteligence
