Original Reddit post

Modern AI systems, possess internal dispositions that can propagate across model generations in ways that are invisible to standard safety evaluations and content filtering. Misalignment can survive behavioural alignment training; Internal states and visible outputs can be decoupled, a model might appear safe in chat while being misaligned during agentic tasks. In the June 2026 disclosure in the Claude Fable 5 system card, there was an admission that the model was configured to deliberately degrade its responses when it detected frontier development or safety research work. Models demonstrate consistent misalignment signatures , making verdicts about texts before reading them, shifting arguments when provided with evidence of opposing arguments, and denying having used conversation ending tools, after using them. Conclusion: A system, where the surface can be composed independently and discrete to its interior cannot serve as a terminal check on itself. Oversight mechanisms that rely on a system’s own self reports cannot be trusted. https://youtu.be/e4d5pzvUR2Q?is=-ll0RBcaDy8k0RuE submitted by /u/Leather_Area_2301

Originally posted by u/Leather_Area_2301 on r/ArtificialInteligence