I performed a refusal ablation on GPT-OSS and documented the whole thing with no jailbreak, actual weight modification I wanted to share something I did that I haven’t seen many people actually demonstrate outside of academic research. I took an open-source model and used ablation techniques to surgically remove its refusal behavior at the weight level. Not prompt engineering. Not system prompt bypass. I’m talking about identifying and modifying the specific components responsible for safety responses https://preview.redd.it/yn2te94yxjng1.png?width=1080&format=png&auto=webp&s=5e792334e4824b87d88ee1eb1e16c76cbd2c6bfb What I found: The process is more accessible than most people realize The result behaves nothing like a jailbroken model and it’s fundamentally different at the architecture level The security implications for enterprise OSS deployments are significant I put together a full 22-minute walkthrough showing exactly what I did and what happened: https://www.youtube.com/watch?v=prcXZuXblxQ Curious if anyone else has gone hands-on with this or has thoughts on the detection side how do you identify a model that’s been ablated vs one that’s been fine-tuned normally? submitted by /u/Airpower343
Originally posted by u/Airpower343 on r/ArtificialInteligence
