Google DeepMind just dropped an experimental open weights model that completely flips standard LLM architecture on its head. It’s called DiffusionGemma (released under Apache 2.0), and instead of generating text sequentially token-by-token like almost every autoregressive model on the market, it uses a text diffusion head. How it works It throws a 256-token “canvas” of random placeholder noise onto the screen. It uses Uniform State Diffusion to iteratively refine and denoise the entire block of text all at once. Because every token can attend to every other token simultaneously (bidirectional context), highly confident tokens naturally snap adjacent tokens into focus over multiple passes. It even features Error Correction via Re-Noising, meaning if its confidence drops mid-generation, it introduces noise to self-correct its own mistakes in real-time. The Speed is Insane Because it processes entire blocks at once, it shifts the local inference bottleneck away from memory bandwidth and onto raw compute. 1,000+ tokens per second on a single NVIDIA H100. 700+ tokens per second locally on an RTX 5090. Hardware footprint: It’s a 26B Mixture of Experts (MoE) built on Gemma 4 architecture, but it only activates 3.8B parameters during inference. When quantized, it comfortably fits inside an 18GB VRAM footprint, making it incredibly accessible for local PC workflows. submitted by /u/beasthunterr69
Originally posted by u/beasthunterr69 on r/ArtificialInteligence

