Original Reddit post

What are LLM benchmarks even for today, when companies are compressing models and splitting them into 50 different sub-models of the same system? We have zero visibility into these sub-models. They all have the same name from our perspective. So how are we supposed to distinguish them from one another and more importantly, how can we tell whether sub-model A is actually better than sub-model B? It’s impossible. It makes me question what value these benchmarks even have for us as users. Are they just selling us a dream that’s only valid during the first week after a model is released? What we really need are continuous benchmarks. Today, Opus is just a shadow of its former self. It probably has nothing left in common with the numbers reported at launch. Navigating this landscape is becoming a nightmare. submitted by /u/_SSSylaS

Originally posted by u/_SSSylaS on r/ClaudeCode