I’m trying to get a better sense of which AI benchmarks people actually trust right now. There are so many of them at this point: METR time horizons, SWE-bench, RE-Bench, GAIA, ARC-AGI, OSWorld, WebArena, Humanity’s Last Exam, and probably a bunch I’m missing. They all seem to measure different things: coding, web agents, long-horizon tasks, reasoning, tool use, research engineering, etc. One thing I’m struggling with is how much weight to give the big, widely cited benchmarks. On one hand, there is obviously a lot of marketing around benchmarks. On the other hand, I don’t think that means the major benchmarks are useless. My guess is that some of them became popular because they do track something real, or because they were designed around tasks that people already believed were meaningful. But that also makes it harder to judge them. If a benchmark was built or selected because it matched what researchers already thought mattered, how do we tell whether it really predicts broader real-world capability, rather than just reflecting the current consensus? For people who follow this more closely:
- Which benchmarks do you actually pay attention to?
- Which ones do you think have held up well?
- Which ones look good on leaderboards but don’t tell you much in practice? Have a nice day ! submitted by /u/DemonLaplacien
Originally posted by u/DemonLaplacien on r/ArtificialInteligence
