The paper is arXiv 2512.01797. Researchers identified what they call H-Neurons: a subset of fewer than 0.01% of neurons in feed-forward layers that encode over-compliance. Not wrong facts. The drive to produce a confident answer rather than admit uncertainty. The key finding that doesn’t get discussed enough: these neurons form during pre-training and barely change during alignment. Parameter stability of 0.97 through the entire fine-tuning process. RLHF doesn’t remove them. It redirects the compliance behavior but leaves the underlying neurons structurally intact. This has a practical implication that I think matters more than the academic finding itself. If hallucination is caused by neurons that prompting and fine-tuning can’t reach, then the fix has to come from outside the model. Not better system prompts. Not “please verify your claims.” Not more RLHF. Something architectural. There are a few approaches people are trying. Constitutional AI constraints, retrieval-augmented generation, chain-of-thought verification. The one I’ve been working on is multi-model peer review. Three models from different providers answer independently, then each reads all three responses anonymously and ranks them. The model doesn’t know if it’s reading its own answer or someone else’s. That removes the deference and anchoring behaviors that H-Neurons drive. After peer review, the top-ranked response gets synthesized, then a different model attacks it adversarially. Sycophancy detection flags when the refinement loop starts rubber-stamping instead of actually critiquing (same H-Neurons problem, different stage). At the end, individual claims get verified against live web sources. I built this into a tool called Triall ( https://triall.ai/ ). One free run without signup if anyone wants to see the pipeline in action. Also neat little demo video here: https://www.youtube.com/watch?v=m44tdRMaCq8 The honest limitation: correlated errors. When all three models learned the same wrong thing from training data, peer review won’t catch it. Research shows about 60% error correlation across providers. The convergence detection flags when all three agree but the claim is unsubstantiated, and web verification catches some of the rest, but it’s not solved. Paper: https://arxiv.org/abs/2512.01797 submitted by /u/Fermato
Originally posted by u/Fermato on r/ArtificialInteligence
