As artificial intelligence systems began scoring extremely high on long used academic benchmarks, researchers noticed a growing issue. The tests that once challenged machines were no longer difficult enough. Well known evaluations such as the Massive Multitask Language Understanding (MMLU) exam, which had previously been seen as demanding, now fail to properly measure the capabilities of today’s advanced AI models. To solve this problem, a worldwide group of nearly 1,000 researchers, including a professor from Texas A&M University, developed a new type of test. Their goal was to build an exam that is broad, difficult, and grounded in expert human knowledge in ways that current AI systems still struggle to handle. submitted by /u/PixeledPathogen
Originally posted by u/PixeledPathogen on r/ArtificialInteligence

