been doing a deep dive on model selection for production inference and pulled togethar some numbers from whatllm.org’s january 2026 report… thought it was worth sharing because the trajectory is moving faster than i expected quick context on the scoring, they use a quality index (QI) derived from artificial analysis benchmarks, normalized 0-100. covers AIME 2025, LiveCodeBench, GPQA Diamond, MMLU-Pro and τ²-Bench across agentic tasks where things stand right now: open source top 5: GLM-4.7 ~ 68 QI / 96% τ²-Bench / 89% LiveCodeBench Kimi K2 Thinking ~ 67 QI / 95% AIME / 256K context MiMo-V2-Flash ~ 66 QI / 96% AIME (best math in open weights) DeepSeek V3.2 ~ 66 QI / $0.30/M via deepinfra MiniMax-M2.1 ~ 64 QI / 88% MMLU-Pro proprietary top 5: Gemini 3 Pro Preview ~ 73 QI / 91% GPQA Diamond / 1M context GPT-5.2 ~ 73 QI / 99% AIME Gemini 3 Flash ~ 71 QI / 97% AIME / 1M context Claude Opus 4.5 ~ 70 QI / 90% τ²-Bench GPT-5.1 ~ 70 QI / balanced across all benchmarks numbers are in the image above, but the τ²-Bench flip is the one worth paying attention to where proprietary still holds, GPQA Diamond (+5 pts), deep reasoning chains, and anything needing 1M+ context (Gemini). GPT-5.2’s 99% AIME is still untouched on the open source side cost picture is where it gets interesting: open source via inference providers: Qwen3 235B via Fireworks ~ $0.10/M MiMo-V2-Flash via Xiaomi ~ $0.15/M GLM-4.7 via Z AI ~ $0.18/M DeepSeek V3.2 via deepinfra ~ $0.30/M Kimi K2 via Moonshot ~ $0.60/M proprietary: Gemini 3 Flash ~ $0.40/M GPT-5.1 ~ $3.50/M Gemini 3 Pro ~ $4.50/M GPT-5.2 ~ $5.00/M Claude Opus 4.5 ~ $30.00/M cost delta at roughly comparable quality… DeepSeek V3.2 at $0.30/M vs GPT-5.1 at $3.50/M for a 4 point QI differnce (66 vs 70). thats an 85% cost reduction for most use cases where reasoning ceiling isnt the bottleneck the gap was 12 points in early 2025… its 5 now. and on agentic tasks specifically open source is already ahead. be curious what people are seeing in production, does the benchmark gap actualy translate to noticable output quality differences at that range or is it mostly neglijable for real workloads? submitted by /u/ashersullivan
Originally posted by u/ashersullivan on r/ArtificialInteligence
