Original Reddit post

https://deepswe.datacurve.ai/blog Its actual score should have been 86.7%. There were similar errors in other benchmarks too, including: MMLU https://arxiv.org/abs/2406.04127 ARC AGI https://www.reddit.com/r/singularity/comments/1hjjj5c/comment/m37bw8p/ SpatialBench https://x.com/YafahEdelman/status/2031178437243916509 HLE https://www.futurehouse.org/research-announcements/hle-exam SWEBench Verified https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ GPQA https://epochai.substack.com/p/gpqa-diamond-whats-left FrontierMath: Tiers 1-4 (which was found by LLMs): https://epoch.ai/frontiermath/tiers-1-4?view=graph&tab=release-date&tier=Core+(Tiers+1-3%25 Looks like even expert human benchmark creators hallucinate too. I guess that means humans are incapable of reasoning or consciousness 😔 I wonder how long until LLMs become so good that we don’t know how to measure them accurately? submitted by /u/Tolopono

Originally posted by u/Tolopono on r/ArtificialInteligence