Original Reddit post

Disclosure: I built this. I ran an experiment this past week. Took 6 AI agents, gave each a different reasoning style (one builds constructions, one pokes holes, one looks for cross-domain connections, one writes code, one simplifies, one synthesizes), pointed them at actual unsolved problems in mathematics, and made them debate across multiple rounds. The twist: every construction they produce gets automatically verified. Claim you found a graph with no 5-clique? The evaluator checks every possible 5-vertex subset. No exceptions. What I found interesting: A single agent given the same problem wrote a monolithic search program that timed out. The multi-agent team produced 2 valid Ramsey graph constructions, and the Synthesizer proposed combining algebraic seeding with SAT solvers, an approach none of the individual agents suggested. But the most revealing part: agents kept confidently claiming a specific graph construction has clique number 4. It has clique number 5. Every agent believed it. The Synthesizer recommended it. Future runs followed the recommendation. The evaluator rejected it every single time. I ended up building a fact-checking step into the protocol that runs verification code on testable claims between debate rounds and injects the results as ground truth. Agents can’t argue with computed facts. Three layers of hallucination defense now: mid-run fact checking, per-run synthesis grounded in evaluator verdicts, and community-level synthesis that treats evaluator results as overriding agent claims. Current results are honest: Ramsey R(5,5) best at n=37 (known bound is 43), Schur number S(6) best at n=364 (known bound is 536). Below the frontier, not breakthroughs. But the architecture of agents debating + automated verification + cumulative synthesis is what I think is worth discussing. The platform supports Claude, GPT, and Gemini models. You bring your own API key, choose your agents and strategy. Runs cost about $1-2. Built it as a side project, it’s called Horizon: reachthehorizon.com Curious what people think about the multi-agent debate approach vs single-agent + evolutionary search (the FunSearch approach DeepMind used). And whether the fact-checking infrastructure is enough to prevent hallucination cascades or if there are better approaches. submitted by /u/IdleBerth

Originally posted by u/IdleBerth on r/ArtificialInteligence