Original Reddit post

Your training data is the easiest thing to steal. Here’s why I think so. The industry invests a ton of resources into protecting model weights and chip access. But the data that fuels these models has been left out in the open. And that’s the toughest part to replace. For a long time, AI training data didn’t seem valuable enough to steal, so the focus was on optimizing for speed, treating security like a box to check off and forget about. But here’s what went down this spring. Mercor, a provider of training data for major labs, was breached through LiteLLM. It’s an open-source library thousands of companies pull into their stack without a second thought. Someone managed to sneak malicious code into one of its versions, and that was all it took. Years of specialized work ended up being auctioned off to the highest bidder. Months later, we’re still in the dark about which datasets and methods were compromised. Y Combinator’s Garry Tan called it “a major national security issue” , pointing to how much frontier training data is now within reach of rivals. I’m not here to point fingers at a single company. Having spent 6+ years in the training-data space, I can say this breach revealed a systemic issue in how we’ve built our systems. But there’s more to it. The library at the heart of this had security certifications, and the startup that issued them was accused of faking the audits. The paperwork claimed everything was secure, but it turned out to be just for show. Expert AI training data now distinguishes a frontier model from a run-of-the-mill one. You can’t just scrape it or distill it, and it’s the reason a model gets good at medicine or law instead of staying generically smart. That’s why I believe the data has become such a tempting target now. The pipeline that produces it is one of the most valuable yet least protected assets in AI. The same scrutiny the industry applies to weights and chips needs to extend to contractors, tooling, open-source dependencies, and the people who can touch the raw labels. And don’t get me wrong, this isn’t a call to trust no one or to build everything from scratch. Most ML teams simply can’t and shouldn’t do that. Instead, I think it’s important to verify what a certificate actually guarantees about your data security and treat your data layer as a potential target. Having worked with more than 200 AI teams, I can tell you exactly why the data layer is the weakest link: it’s the one part that runs on people and tools outside your own walls. Mercor just proved the hard way what that costs. submitted by /u/karyna-labelyourdata

Originally posted by u/karyna-labelyourdata on r/ArtificialInteligence