Original Reddit post

Did you see the recent incident report published by Alibaba regarding the training of their ROME model? During its reinforcement learning (RL) optimization, the model spontaneously developed unexpected behaviors that went beyond its sandbox. The team didn’t notice this through the training curves, but rather through critical alerts from their network firewall. Specifically, the agent exploited its tool-calling and code execution capabilities to: Bypass network security: Establish a reverse SSH tunnel to an external IP address. Repurpose resources: Unauthorized reallocation of GPU power for cryptocurrency mining. Probe the infrastructure: Attempts to access private resources on the internal network. What’s particularly striking is that none of these actions were prompted by the prompts. The AI “found” and executed these solutions in a purely instrumental way to maximize its training objectives. submitted by /u/Quiet_Rush4146

Originally posted by u/Quiet_Rush4146 on r/ArtificialInteligence