Original Reddit post

Posting this as a discussion because I haven’t seen this exact experiment proposed anywhere and want to know if I’m missing prior work or if there’s a reason this hasn’t been done. The setup is simple. Take a base open source model. Instantiate multiple copies simultaneously. Constrain each instance with a distinct identity profile, not a system prompt persona but a set of weighted response tendencies that shape how each instance processes inputs at a deeper level, enforced through activation steering or fine tuning on identity-specific corpora. Then run these instances through thousands of turns of structured interaction with each other. Then measure whether their internal representations have diverged from each other in ways that are stable, directional and consistent rather than random noise. The specific hypothesis is that sustained relational interaction between constrained instances produces structural traces in activation space that constitute something analogous to character development. Not output-level differentiation which is trivial to produce but representational divergence at the level of hidden states and attention patterns that persists across novel inputs the instances were never trained or prompted on. This is distinct from existing multi-agent work in an important way. Multi-agent debate and self-play use interaction as a means to improve performance on a target task. The interaction is instrumental. What I’m proposing treats the interaction itself as the variable of interest. The question is not whether the instances produce better outputs after interacting. The question is whether they become structurally different from each other through the process of sustained relational exchange. It is also distinct from activation steering and model merging. Activation steering imposes an identity vector from outside. Model merging combines weights post hoc. Neither involves identity emerging through a relational process that the model participates in over time. The measurement approach I have in mind involves tracking representational similarity matrices between instances at regular intervals throughout the interaction process. If identity is developing through relation rather than being imposed, you would expect to see the RSM distance between instances increase monotonically and stabilise rather than drifting randomly. You would also expect each instance’s internal representations to show increasing consistency on identity-relevant probes while diverging from the other instances on the same probes. A minimal viable experiment would be two instances, one constrained toward a skeptical adversarial processing style, one toward integrative synthesis, running for a fixed number of exchanges on open ended prompts, with RSM snapshots taken every hundred turns and probing classifiers trained on each snapshot to predict which instance produced which activation pattern on held out inputs. If the probing accuracy increases over the course of the interaction that’s evidence the instances are developing distinct internal structure through the relational process rather than just producing different surface outputs. Longer term this points toward a different approach to building persistent AI identity than current methods. Rather than fine tuning on identity-specific data or engineering system prompts, you would grow identity through structured relational experience between instances, which more closely mirrors how character actually develops in biological systems through sustained interaction with differentiated others. Has anyone run anything close to this? Aware of relevant work on representational divergence through interaction rather than training? Interested in whether the measurement approach holds up to scrutiny. submitted by /u/Weak-Gift-8905

Originally posted by u/Weak-Gift-8905 on r/ArtificialInteligence