Original Reddit post

I found this paper, “Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought” by Ramji, Naseem, & Astudillo to be pretty interesting: https://arxiv.org/abs/2604.22709 Basically they trained an LLM to do its reasoning with a set of reserved tokens that are initially meaningless, resulting in substantial token savings on CoT problems with no significant degradation of performance. On one level, I love that, because it saves computation and gives models a way to think that’s probably similar to what well-trained humans do in their field of specialty, i.e., reason in abstractions directly, without having to put everything into words. But on the other hand, it seems like this would make the interpretability problem much harder. LLMs can already hide their true intentions to some extent, but this would make deception much easier for them, I would think. The internal language could be like “…and then step three, kill all humans, wait, better say ‘give puppies to all humans’ in our final output” and we’d have a hard time detecting that. One possible way to mitigate that might be to train another model to convert these internal tokens to something interpretable, for auditing purposes. But it’s not entirely clear to me how that would be done. We’d certainly have to be careful about co-training the interpreter and the main model on alignment, because we’d risk them learning a dual-channel encoding where the model means one thing but the interpreter says another, in a coordinated way that fulfills the reward function, while not giving accurate insight into any deception going on. What do you think? submitted by /u/JoeStrout

Originally posted by u/JoeStrout on r/ArtificialInteligence