Original Reddit post

Thought this was an interesting benchmark looking at LLM’s long-horizon research abilities https://x.com/intology/status/2056764236668493868 . The authors ran coding agents on the Karpathy’s NanoGPT speedrun competition and compared to human progress. They saw the agents were only able to recover 9.3% of the 5-month human progress and that they generally struggled to implement research ideas. Not exactly sure what part of their evaluation lead to this - the specific prompting they used for the models, prompting that was part of the harness itself, the basic model priors, ect. I’m also not sure why they got such different results compared to https://x.com/intology/status/2056764236668493868 . It does seem like measurements like these could be the next step for benchmarks though - coding agents powered by modern LLMs + custom harnesses are pretty good at editing code and iterating against test cases but the next big hurdle the AI industry is going to come up against is how to apply them to these sort of long-horizon research problems. Blog: https://www.intology.ai/blog/nanogpt-bench submitted by /u/Icy_Goat

Originally posted by u/Icy_Goat on r/ArtificialInteligence