Original Reddit post

https://preview.redd.it/d08hsyc86m3h1.png?width=4206&format=png&auto=webp&s=f2e116fb646a47735bed8dae7dc86cee27b32f7d So the Financial Times and an AI safety group called Alice did a joint test showing that safety features on open-source models from Meta and Google can be stripped away in literally minutes. Journalists actually used a free tool on GitHub called Heretic to remove the safety filters from Meta’s Llama 3.3 model in under 10 minutes. The guy who made Heretic, Philipp Emanuel Weidmann, said that since he released it, users have built over 3500 uncensored models, and they’ve been downloaded around 13 million times total. Weidmann also claimed he broke the guardrails on Google’s Gemma 4 model just 90 minutes after it came out. The method is called abliteration, and it basically tweaks the internal parameters of the neural network directly, forcing the AI to give answers on things like bioweapons or malware. This technique doesn’t work on closed models like OpenAI’s ChatGPT or Anthropic’s Claude though, because nobody from the outside can access their source code. Kawin Ethayarajh, an assistant professor of applied AI at Chicago Booth, pointed out that tools like this make it super hard for governments and tech companies to regulate AI safety while things are still in development. The CEO of Alice, Noam Schwartz, added that with these modified systems floating around, society just needs to get ready for a completely new type of threat. Source: https://futurism.com/artificial-intelligence/tools-strip-ai-guardrails-in-minutes submitted by /u/andrewaltair

Originally posted by u/andrewaltair on r/ArtificialInteligence