eifachposte

eifachposte

https://preview.redd.it/8pvzyj41qe3h1.png?width=870&format=png&auto=webp&s=b1c39577a1cb660484c9a6877919c4a9362a72d5 TL;DR: For a decade, different research communities (domain adaptation, adversarial training, LLM alignment) have treated their loss functions as separate fields. We proved algebraically that they are all trying to estimate the exact same thing: the deployment nuisance covariance matrix ( Sigma_{task} ). The Real Result: By simply estimating this matrix correctly and applying one geometric penalty term, we dropped LLM sycophancy on Qwen2.5-7B from 38.5% down to 13.5%, and beat standard PGD adversarial training by 14.8%. Code and paper below. The Geometric Blind Spot Every time you deploy a model, inputs change in ways that shouldn’t affect the label (lighting shifts, accents vary, prompt styles evolve). Paper’s Theorem G proves something terrifying: If your regularization matrix misses even one direction where the real-world data varies, the model will actively exploit that blind spot to minimize training loss. You cannot train your way out of this. More data, scaling to 70B parameters, or cranking up the regularization strength ( lambda ) won’t fix it. If the geometry is wrong, the drift floor is permanent. Does this actually work in practice? Yes. I ran this across 13 blocks and 5 modalities using the exact same 12 lines of PyTorch. Here are two examples:

LLM Alignment (Fixing Sycophancy): Standard DPO makes a model’s hidden states highly sensitive to “style.” The reward model gets confused between “this is correct” and “this is the style the user wants,” leading to sycophancy. By estimating the style-matrix and adding our PMH loss, we preserved the geometry. The model stopped gaming the style, dropping sycophancy from 38.5% to 13.5%.
Adversarial Training (The Subspace Staircase): Standard PGD-Adversarial Training ruins your clean accuracy. We tested our geometric penalty on a CIFAR-10 ViT. By matching the exact PGD-delta Gram matrix, we achieved adversarial robustness while keeping clean accuracy at 79.4% (beating standard PGD-AT by nearly 15 percentage points). The Code Once you know the matrix, the training is just a formula (the PMH loss): https://preview.redd.it/34h9qxappe3h1.png?width=689&format=png&auto=webp&s=2a513d188f218ad67568179c39ac739b21e92d54 We packaged this so you can drop it into any architecture. Identify your shift, estimate the matrix, and add the term. Paper: https://arxiv.org/pdf/2605.22800v2 GitHub (pip install matching-pmh): https://github.com/vishalstark512/matching-pmh I’d love to discuss the optimization reachability open problem or the LLM alignment geometry with anyone interested! submitted by /u/Difficult-Race-1188

Originally posted by u/Difficult-Race-1188 on r/ArtificialInteligence

10 years of AI robustness tricks (PGD, RLHF, Data Augmentation) are actually computing the same hidden matrix. We proved what happens when you get it wrong.

10 years of AI robustness tricks (PGD, RLHF, Data Augmentation) are actually computing the same hidden matrix. We proved what happens when you get it wrong.