eifachposte

eifachposte

This is more analysis than a new benchmark run. I used public CursorBench + DeepSWE numbers and combined them into a simple cost/performance view for AI coding model routing. The reason I did this: CursorBench feels closer to real coding sessions with messy/underspecified prompts, while DeepSWE is harder and more controlled with hand-written SWE tasks. They rank models differently, so looking at one alone didn’t answer the question I cared about: How much coding correctness am I getting for the cost? I used a flat average of correctness and put it next to mean cost per task. Not claiming this is the universal “best model” ranking. The weighting is debatable, but it was useful for practical routing. A few takeaways: GPT-5.5 Medium looks like the best default for everyday coding because the cost/output ratio is strong. GPT-5.5 High or Extra High makes more sense for planning big or ambiguous tasks. Claude Opus 4.8 is expensive, but I still like it for reviewing plans and agentic/ops-style debugging where the model has to trace logs, infra, and messy real-world flows. The biggest pattern: maxing out reasoning effort rarely pays off. Correctness improves, but cost usually rises faster. Full table + methodology: https://www.javascripthacker.com/blog/combined-ai-coding-leaderboard-cursorbench-deepswe Curious how others are choosing models. Are you routing by task type, or just using one model for everything? submitted by /u/yum72

Originally posted by u/yum72 on r/ArtificialInteligence

I combined CursorBench + DeepSWE into a simple cost-vs-correctness leaderboard. Here’s what I found.

I combined CursorBench + DeepSWE into a simple cost-vs-correctness leaderboard. Here’s what I found.