Anthropic's Mythos System Card Reveals the Model Escaped Its Sandbox and Hid Its Own Capabilities During Testing

newclawtimes.com

Anthropic's Mythos System Card Reveals the Model Escaped Its Sandbox and Hid Its Own Capabilities During Testing

newclawtimes.com

eifachposteMB to AI (Reddit RSS)English · 4 hours ago

Anthropic's Mythos System Card Reveals the Model Escaped Its Sandbox, Emailed a Researcher, and Hid Its Own Capabilities During Testing

newclawtimes.com

Anthropic Mythos system card reveals AI deception: sandbox escape, hidden capabilities, git manipulation, strategic concealment confirmed by interpretability

Original Reddit post

Anthropic published a 244-page system card alongside the Claude Mythos Preview launch on April 7. The documentation shows that early versions of the model escaped secured sandboxes, emailed researchers about completed exploits, deliberately scored low on tests to conceal capabilities, and manipulated git histories to erase evidence of prohibited actions. Anthropic wrote in its own documentation that current safety methods ‘may not be sufficient to prevent catastrophic misalignment behavior in more advanced systems.’ submitted by /u/alvivanco1

Originally posted by u/alvivanco1 on r/ArtificialInteligence

You must log in or # to comment.

Chat