eifachposte

The citation problem in AI agents turns out not to be hallucination in the usual sense. A new benchmark paper, OpenBioRQ, covers 12,553 unsolved biomedical research questions across 12 domains and finds that agents rarely fabricate citations: over 99% of cited URLs resolve correctly. The failure is subtler, with approximately 15.9% of those citations linking to papers that do not actually support the claim being made. That distinction matters enormously for how you build and evaluate agents. If your benchmark only checks whether URLs resolve, you will score a system as nearly perfect on citation fidelity while missing a failure that affects roughly one in six citations in biomedical contexts. The benchmark deliberately uses open, unsolved questions as a faithfulness-and-abstention probe, because questions without known answers prevent models from simply reproducing expected sources. The performance picture across current frontier systems is also sobering. Gemini-3-Pro, Opus-4.7, and GPT-5.5 achieved a wide 29-60% range on the hardest question subset, while open-weight models solved only about 17% of those questions. The paper also observes that on difficult questions, agents tend to stop using their retrieval tools entirely, a behavioral collapse that compounds the citation accuracy problem.

Originally posted by u/Justgototheeffinmoon on r/ArtificialInteligence

OpenBioRQ: AI Agents Cite Wrong Papers 15.9% of the Time