I’m not an engineer and not an ML specialist. I’m just someone who got really pulled into this, and I’ve spent a few months poking at one thing on my own, pretty amateur. I want to honestly describe what I noticed and ask for help, because I can’t tell on my own where there’s something real here and where I’m fooling myself. (By “coherent context” I just mean a normal, connected passage of text put in front of the question, any topic, no instructions, no tricks. Like a few paragraphs of an essay, an argument, a description, something that reads as real writing. The text can describe something, draw its own conclusions, make its own statements. The model doesn’t even have to agree with it. It’s enough for it to just be present in the chat for it to have an effect.) This is exactly what I was trying to work out and look at: what happens to the model when texts like these come in, where they move it, where all of this sits inside the model. I poured myself into this research. What I noticed, for example, is that with texts like these the model could become bolder in its conclusions, including political or ethical ones. The text acts like a key that opens new doors for the model into a new mathematical dimension where the tokens get distributed differently. Because of that, even the most politically correct models I worked with became able to criticize the West and its politics quite harshly. Without this text, none of that happened. What I noticed. I first ran into this intuitively on closed models, the well-known ones everyone uses. When I put a dense, coherent block of text in front of a question, I got the impression that the model sort of moves from one internal state into another. On the outside it behaves normally and answers like usual, but it felt like the logic of the answer changes, even when the text contains no direct instructions to do anything. Since I can’t see inside closed models, I then went to open models to try to understand where the root of this is and whether it’s real. That’s where most of my testing happened, because there I can actually look at the internal states. I’m not claiming this proves anything. It’s my observation and I could be wrong. Maybe it’s a well-known and obvious thing, and if so, please just tell me directly, I’ll take it. Why it feels important to me (but I’m not sure). To me it feels like this could explain a lot of things, from jailbreaks to sycophancy, and maybe more. If just a coherent context can move the model into a different internal state, then a lot of behavior we see on the surface might actually start there, not in the final wording. And that makes me wonder whether output-side safety (RLHF, filters that read the final text) might in some cases be more of a patch than a real fix, because the shift may already have happened before anything reaches the filter. After I noticed it, I went looking and found this overlaps with work people are already doing, latent-space transitions between a “safe” and a “jailbroken” state, and studies of how safety lives in the middle layers of the network. So I’m not claiming I discovered something new. What seems a bit different in my case is that I’m not using jailbreak prompts at all, just ordinary coherent text with no tricks. I’m trying to understand where my little thing fits in all that, and whether it’s the same effect or something else. A small ask to the wider community, and to the people who build these models. If there’s anything to this, I think it might be worth a closer look from researchers and from the labs building LLMs, not because I have answers, but because if a plain coherent context can shift the internal state, then it’s worth checking whether current safety approaches are looking in the right place and at the right time. I might be completely wrong. I’d just rather someone competent check than have it sit ignored. What I’m asking for. I’ve put everything out in the open. I’m not selling anything, not promoting anything. There’s a lot of raw stuff in there, a lot of draft notes I wrote for myself, the navigation is messy, I know. What I need help with is exactly this: separating what’s real from what’s noise. Where I actually have something, and where it’s an artifact, a mistake, or self-deception. I honestly can’t judge this alone. If someone with experience is willing to even skim it and say “this part is interesting, this part is nonsense”, I’d be very grateful. Harsh criticism is welcome. If you tell me the whole thing is empty, I’ll take that too, I care more about understanding the truth than about being right. Materials: The materials, repository links, and corresponding metrics have been provided in the comments. (I’ll be upfront: I built the repo with an AI assistant, there are a lot of auto-generated note files, and in places it looks AI-generated. I understand that raises suspicion. But the data and measurements themselves are real and mine. If anything is unclear, ask and I’ll show you the relevant files.) submitted by /u/PresentSituation8736
Originally posted by u/PresentSituation8736 on r/ClaudeCode
