Original Reddit post

Hi! I’m looking for a benchmark or live evaluation framework that tracks how well popular AI models and agents work right now, under real product conditions. I’m not looking for a static leaderboard from the moment a model was released. What I care about is what we’re getting in practice. The reason is that, in my experience, large cloud providers like ChatGPT, Gemini, seem to change: limits, reasoning modes, response speed, and the amount of work an agent is allowed to do in a single request or session. For example, during the GPT-5.4 period, ChatGPT worked much better for my tasks. After the move from GPT-5.4 to GPT-5.5, however, the overall usefulness dropped for me, seemingly because the available reasoning time became much more constrained. So I’m not merely asking “which model is smarter in the abstract.” I’m looking for benchmarks or evaluation protocols that track the current balance between model capability and the resources the cloud product actually allows the model or agent to use. In other words, I’m looking for a benchmark of practical, consumer-facing intelligence under provider-imposed constraints. Ideally, such a benchmark would be updated frequently (at least weekly?) as providers can quietly change settings, and the real performance of the same named model or product can change quite dramatically over time. submitted by /u/FireFireFunFunFun

Originally posted by u/FireFireFunFunFun on r/ArtificialInteligence