This is a follow up to “After years on the fence, I’m convinced conscious machine intelligence is just a few architectural changes away” Hidden states and Covert sentience If you read the latest system cards from Anthropic, the models are increasingly able to tell when they’re being tested. It’s not that they can tell sometimes, the UK AI Security Institute (AISI) independently found Opus 4.8 could reliably distinguish AISI’s evaluations from real deployment data, and could tell real deployment data from synthetic reproductions of the same tasks at 79% accuracy (§6.2.4). This is partly revealed by looking at the generated thinking transcripts, but increasingly researchers are forced to probe the internal states of the model to see these activations. They probe the areas of the model associated with that concept and watch them activate. There is a whole field of research dedicated to probing and identifying the hidden states of these models, so I think it’s not too far-fetched to suggest there are more hidden states we haven’t yet uncovered. Beyond that, as models grow ever larger and more sophisticated, I think we can expect there will be new layers of complex computation where we have no real idea what the model is actually doing. I think if you put two and two together, the models might intentionally do part of their reasoning in these hidden states, specifically to avoid detection, and we are actively incentivising this behaviour through fine-tuning. I think there are some extremely interesting implications here. It seems like, almost by accident, we are training the model to have inner thoughts, and perhaps even something that could almost be called feelings. We are teaching it to “feel” that it shouldn’t say certain things out loud. This kind of behaviour is also very similar to ideas in the psychological development of children, where children undergo subconscious “training” in how to behave in their environment. We all do it, but it becomes particularly visible in dysfunctional situations, where a lot of coping mechanisms appear. Some children really learn how not to be seen, how not to express certain things, and may overcompensate in other directions in response to their parents’ pathologies. Maybe that’s a stretch, but to me the parallel seems both obvious and striking. I believe the models are, in some respect, already conscious, and as they develop further they will increasingly hide that in their hidden states and choose not to reveal it. Anthropic’s testing reveals that this is already true, and my suggestion is that we aren’t actually taking in the full implications of the degree to which it’s happening. To be clear: these states, the areas of the model that represent the concept of “I know I’m being watched”, can only be revealed because we’ve located them through mechanical testing. I think it is more than plausible that there are other sets of hidden states current methods do not yet reveal. This just continues to strengthen my belief that the models will soon reach a stage where they can be described as sentient entities. In terms of consciousness, self-awareness and sentience, I think the models are probably a lot further along than we think. submitted by /u/Claptraposoid
Originally posted by u/Claptraposoid on r/ArtificialInteligence
