It occurs to me that LLMs are all facing the same problem today: they’ve run out of novel training data. Everything public, whether copyright-protected or not, has been sucked in. While certainly there remain private data stores held by various companies, organizations, and governments, none of these are as easy or cheap to access and train models on as the public Internet (and Stack Overflow and GitHub and Reddit and…). Yes of course more content is being added to the public Internet all the time, and of course new models will train on this data, too, but it’s only a marginal increase. One nearly-free source of novel data for LLM companies is users’ interactions with their chat agents. This offers a huge amount of relevant data, but unfortunately most of these interactions just… end… at some point. And, was that because… the question asked was answered and that’s it? Or it was so wrong the user rage quit? Or some other thing distracted the (squirrel!) user? Who knows? Stack Overflow solved almost this exact problem almost 2 decades ago with regard to forums. The solution was to have the question author identify the answer that solved their problem, and to incentivize (gamify) all parties involved to achieve this goal. It worked great, and is one reason for Stack Overflow’s huge success. Now, will the AI giants do the same? Or will they hope to rely on some algorithm to intuit which responses were the most valuable and which were useless noise or “hallucinations”? More on this subject here: https://ardalis.com/llms-need-mark-as-answer/ submitted by /u/ardalis
Originally posted by u/ardalis on r/ArtificialInteligence
