Original Reddit post

Hi@ll, Most of us are used to LLMs working like blazing-fast typewriters. The model predicts one token, then the next, and so on (autoregression). This approach gave us ChatGPT and Claude, but it also trapped us in a “glass ceiling” of latency and cost. Mercury 2 from Inception Labs just launched, and it looks like that ceiling has cracked. 1000+ tokens per second isn’t “optimization” – it’s in a different league. For comparison: GPT-5 mini and Claude Haiku both pull in bursts of 70-90 t/s. Mercury 2 does it over 10 times faster. Importantly, they achieved this not through better chips or quantization, but by changing the fundamentals. Instead of writing word by word, the model uses diffusion. Writing vs. Sculpting Imagine the difference. Traditional LLM: They write a letter line by line. If they make a logical error halfway through, they have to continue or start over. Mercury 2 (Diffusion): It’s more like sculpting in clay or developing a photo. The model generates “noise” the length of the entire response and sharpens it in several parallel steps. The entire response—from the headline to the Python code—is created simultaneously. The end of “cascading hallucinations”? The most interesting feature of text diffusion is its native error correction. In autoregression, an error at the beginning of a sentence ruins everything else (a domino effect). In Mercury 2, the model can “correct” the beginning of a sentence in the fourth or fifth iteration because it already knows what the end should look like. This is why the model scores >90% on math tests (AIM), despite being so absurdly fast. Why will this save us from “AI lag”? We all want AI agents that plan and act. The problem is that current agentic workflows take forever because each reasoning step involves waiting seconds. Mercury 2 reduces this time to a fraction of a second. A latency of 1.7 seconds for complex tasks means that interacting with AI is no longer “sending a query” but becomes a real-time conversation. Verdict Inception Labs (the team behind Flash Attention, so they know what they’re doing) has proven that diffusion isn’t just about Midjourney and image generation. This could be a new architecture for text that will allow us to overcome the scale limitations faced by giants like OpenAI and Google. What are your thoughts on this? Will we see a mass migration from Transformer/Autoregression to Diffusion architectures, as has happened in the world of AI graphics? submitted by /u/TeachingNo4435

Originally posted by u/TeachingNo4435 on r/ArtificialInteligence