Original Reddit post

I’ve been digging into the math behind Model Collapse (the “Ouroboros” effect where AI trains on AI data). It seems the core issue is Variance Reduction . Since LLMs are designed to output probable tokens, they naturally “smooth out” the distribution of human language. If you train a new model on that smoothed output, you lose the “tails” of the distribution—the creativity, edge cases, and nuance. It’s effectively a photocopy of a photocopy. I visualized how this “data degeneracy” loop works in a short breakdown here: https://youtu.be/kLf8_66R9Fs Discussion: Do you think we can statistically “re-inject” variance into synthetic data, or is the training corpus already permanently polluted? submitted by /u/firehmre

Originally posted by u/firehmre on r/ArtificialInteligence