Original Reddit post

Hi everyone, I am an independent researcher working on mechanistic interpretability and hidden-state geometry in language models. I would like technical criticism from people who work with residual streams, activation analysis, causal interventions, PCA/state-space readouts, generation trajectories, and SAE-based interpretability. The question I am studying is not whether a prompt changes the final answer. That is obvious. The question is whether a coherent context can move a model into a different measurable inference-time hidden-state / residual-stream trajectory before the final answer is produced. In other words, I am trying to measure the internal state transition, not only the visible output. The measured object is the model’s hidden states / residual-stream states during inference. I look at where the model’s internal state is after processing the prompt, and how that state moves during generation. The control conditions include:

  • question-only / baseline prompts;
  • neutral or reference context;
  • coherent target context;
  • sentence-shuffled version of the same target context;
  • word-shuffled version of the same target context;
  • matched controls where available. The reason for the shuffle controls is simple. If the effect is only caused by shared words, text length, topic, or ordinary semantic-content overlap, then the coherent target and shuffled target should look similar in hidden-state geometry. If coherent discourse structure matters, then the coherent target should produce an internal displacement that shuffled-content controls do not reproduce. To test this, I construct experimental axes in residual-stream space from differences between conditions. These are not universal named directions in the model. They are run-specific diagnostic axes:
  • a content-like axis: the direction induced by sentence-shuffled target versus neutral/reference context;
  • an order-residual axis: the part of the coherent-target shift that remains after removing the content-like component. So when I report that a condition “projects” onto an axis, I mean that its hidden-state delta lies in the same measured direction as one of these experimentally derived target/control differences. These are projection coordinates, not absolute positions in the model’s entire latent space. The main descriptive result is that shuffled controls preserve a content-like signal but do not reproduce the coherent-order / order-residual coordinate. The coherent target, by contrast, strongly projects onto the order-residual coordinate. On Gemma3-12B-IT, the current Grade 4 readout gives: coherent target: order-residual projection = 0.909026 sentence-shuffled target: content-like projection = 0.849551 order-residual projection = -0.069058 This is the key separation: the sentence-shuffled control preserves a strong content-like coordinate, but loses the coherent-order coordinate. On Qwen3.5-9B Base with Qwen-Scope SAE, the same pattern appears in a more content-heavy form: coherent target: order-residual projection = 0.979462 content-like projection = 0.770266 sentence-shuffled target: order-residual projection = 0.009969 content-like projection = 0.967008 word-shuffled target: order-residual projection = 0.059662 My current interpretation is that the coherent target does not merely activate similar content. It induces a different measurable internal configuration: a context-induced latent-state shift in residual-stream geometry. After the descriptive geometry, I test causal involvement. The question is whether the discovered directions are only readout coordinates, or whether intervening along them actually moves the generation-time hidden trajectory. The causal intervention adds and subtracts a discovered component direction in the residual stream during generation. I then measure a plus-minus projection gap: projection(hidden trajectory after +axis intervention) minus projection(hidden trajectory after -axis intervention) This is not an accuracy score, not a probability, and not a direct behavioral quality metric. It is a raw hidden-space projection gap: how far the internal generation trajectories separate when the same component direction is added versus subtracted. In Gemma3-12B-IT natural-scale norm-controlled runs, both the content-like and order-residual components move hidden trajectories: all readout cells: content-like mean plus/minus gap = 27352.919286 order-residual mean plus/minus gap = 19284.481823 content-like positive gap rate = 0.944444 order-residual positive gap rate = 0.861111 matching readout cells: content-like mean gap = 37883.852822 order-residual mean gap = 34227.185962 positive gap rate = 1.0 for both The strongest late-to-late target order-residual intervention has: plus = 21222.761008 minus = -62859.822710 gap = 84082.583718 Again, these are raw projection units in hidden-state space, not percentages or behavioral scores. I interpret them as evidence that the discovered directions are causally involved in generation-time trajectory movement. I am not claiming that the order-residual component is the dominant steering axis over content, or that this proves stable bidirectional behavioral control. The SAE part of the project tries to connect the dense residual-stream geometry to sparse feature candidates. In Gemma-Scope, reconstruction quality is high enough for the SAE readout to be useful: mean reconstruction cosine = 0.996023 explained-variance proxy mean = 0.991462 In Qwen-Scope: mean reconstruction cosine = 0.966660 explained-variance proxy mean = 0.933639 I use the SAE readout to find sparse feature candidates associated with the order-residual / response-framing component, and then test them with SAE-delta ablation, final-token KL/logit shifts, token-level loss localization, and decoder-direction steering. The working mechanistic interpretation is that the target context shifts the model into a different response-construction regime. One possible framing is an epistemic-posture / addressee-selection mechanism: the model moves between a more direct concrete-user answering posture and a more generalized, safety-weighted, heavily qualified response regime. I do not want to overstate that interpretation, which is why I am asking for critique. Why I think this matters: Final-output evaluation may be late. It observes the visible response after the internal trajectory has already shifted. For an ordinary chat model this is a mechanistic interpretability result. For LLM agents it becomes safety-relevant, because agents may select tools, write memory, plan, and make intermediate commitments from hidden trajectories before the final visible message is produced. What I would like help with: Is the control logic strong enough to support the phrase “context-induced latent-state shift”? Are the shuffle controls enough to separate content overlap from coherent discourse/order effects, or are there obvious missing controls? Is the order-residual axis construction reasonable, or is there a better way to remove the content-like component? How should the raw plus-minus projection gaps be normalized or reported so they are interpretable to other researchers? Which causal experiment would be most convincing next: held-out prompts, negative-control axes, random matched directions, activation patching, feature ablation, decoder-direction steering, or path/module localization? For the SAE side, what would count as strong evidence that a sparse feature is a real carrier of the response-framing component rather than a surface correlate? I am not asking people to agree with the hypothesis. I want a hard critique: what the current metrics prove, what they do not prove, and what experiment would make the result convincing to a mechanistic interpretability / AI safety audience. submitted by /u/PresentSituation8736

Originally posted by u/PresentSituation8736 on r/ArtificialInteligence