Original Reddit post

arXiv (Paper): https://arxiv.org/abs/2603.12288/ GitHub (R simulation): https://github.com/tjleestjohn/from-garbage-to-gold/ I’m Terry, the first author. This paper is the result of 2.5 years of work trying to explain something I kept seeing in industry that lacked a good theoretical explanation. A modern paradox: Models trained on vast, incredibly dirty, uncurated datasets — the kind of data everyone says you can’t model without cleaning first — were sometimes outperforming carefully built models trained on clean, curated data. This completely defies the “Garbage In, Garbage Out” mantra that drives enormous amounts of enterprise data cleaning investment. I couldn’t find a satisfying formal explanation for why this kept happening. So, I spent 2.5 years building one. The paper is long because the GIGO paradigm is deeply entrenched. The mathematical arguments that challenge it required connecting several theoretical traditions that don’t normally talk to each other, and I wanted the paper to be comprehensive. The short version of the paper: The GIGO paradigm treats data quality as a property of individual variables — make each one as clean and precise as possible before modeling. This is often the right instinct. But it misses something fundamental. For data generated by complex systems — medical patients, financial markets, industrial processes, sensor networks — there are underlying latent states that drive everything you can observe. Your observable variables are imperfect proxies of those underlying states. The question isn’t just “how clean is each proxy?” It’s “do your proxies collectively provide complete coverage of the underlying states?” Even perfectly cleaned proxies, if there aren’t enough of them, leave you with irreducible ambiguity about the underlying states. I call this “Structural Uncertainty” — and no amount of cleaning can fix it. The only fix is more diverse proxies, even imperfect ones. This is the formal proof of when and why GIGO fails. And the conditions under which it fails often describe complex enterprise data environments. The practical implication: In domains where these conditions hold, data quality is better understood as a portfolio-level architectural property than an item-level cleanliness property. The question shifts from “how do I make each variable cleaner?” to “does my predictor set provide complete and redundant coverage of the underlying latent drivers?” These are genuinely different questions with genuinely different answers. The real-world example: This isn’t just theory. The core finding was demonstrated at scale at Cleveland Clinic Abu Dhabi — predicting stroke and heart attack using data from more than 558,000 patients, over 3.4 million patient-months, and thousands of uncurated variables from a real-world electronic health records with no manual cleaning. We achieved .909 AUC, substantially beating the clinical risk models that cardiologists currently use as standard of care. Published and peer-reviewed in PLOS Digital Health. [Towards artificial intelligence-based disease prediction algorithms that comprehensively leverage and continuously learn from real-world clinical tabular data systems | PLOS Digital Health\
] The honest caveat: This doesn’t work everywhere. The framework requires data generated by complex systems with underlying latent structure. Medical data, financial data, sensor data, industrial data — these typically fit. Simple, flat data-generating processes don’t. The paper explains how to assess whether your data fits the conditions. The simulation: There’s a fully annotated R simulation in the GitHub repo demonstrating the core mechanism — how adding dirty features systematically outperforms cleaning a fixed set across varying noise conditions. Run it yourself. Questions? Criticisms? Happy to engage with questions or pushback — including on the scope conditions, which are the most important thing to get right. submitted by /u/Chocolate_Milk_Son

Originally posted by u/Chocolate_Milk_Son on r/ArtificialInteligence