I’ve spent the last while using a wide range of coding models and coding-agent environments to build and maintain real products/codebases in production. This is my personal experience-based tier list. It is not a formal benchmark. My rough ranking is based on practical coding-agent performance across: ability to understand an existing codebase multi-file refactors debugging accuracy maintaining architectural consistency ability to complete longer tasks without drifting quality of implementation decisions frequency of hallucinated files, APIs, or assumptions how often I had to intervene how production-ready the output was My current experience: S Tier Claude Opus 4.7 A Tier ChatGPT 5.5 GLM 5.1 B Tier Qwen3.7 Plus Kimi K2.6 DeepSeek V4 Pro Claude Sonnet 4.6 C Tier Qwen3.6 Plus DeepSeek V4 Flash D Tier Grok 4.3 Gemini 3.1 Pro Gemini 3.5 Flash Nemotron MiniMax 2.7 F Tier Mistral Medium MiMo V2.5 Pro A few notes: Claude Opus 4.7 has been the strongest for large codebase work, especially when the task requires maintaining context across multiple files and making sound implementation decisions. ChatGPT 5.5 has been very strong and arguable is pushing into S-Tier territory, but seems to always miss out on one or two things every time. I would place it close to S-tier, but in my experience Opus has been slightly more reliable for large, messy, production codebases. GLM 5.1 surprised me. It has been much better than I expected for agentic coding work, and honestly if they had reliable providers and good business practices it probably could move into S-Tier. Qwen, Kimi, and DeepSeek have been capable, especially for contained tasks, bug fixes, and fast iteration. I still find they require more supervision on architecture and edge cases. Gemini has been bad for me. It can be useful, but I have seen more drift, more incomplete implementations, and more cases where I needed to re-check the work carefully. Mistral Medium and MiMo are straight up ass bags. Curious where other people would rank these, especially from people who have used them inside real repos rather than isolated prompts. submitted by /u/Cute_Dragonfruit4738
Originally posted by u/Cute_Dragonfruit4738 on r/ArtificialInteligence
