Paper: Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair arXiv: 2604.21395 Paper: https://arxiv.org/abs/2604.21395 Code: https://github.com/vishalstark512/PMH I want to tell you about a result that genuinely surprised me when it came out of the experiments, and I think it will surprise you too. PGD adversarial training: the gold standard for robustness, makes clean-input geometry worse than no regularization at all. Not marginally worse. Measurably, consistently, mechanistically worse. And we can explain exactly why. But let me start from the beginning. The Setup: What Does ERM Actually Force Your Model to Learn? Every production model trained today uses empirical risk minimization. You minimize expected loss on labeled data. Simple. Here’s what we proved: any ERM minimizer must retain non-zero Jacobian sensitivity in every direction that predicts training labels — including directions that are pure nuisance at test time. This isn’t a training failure. It isn’t fixable with more data, bigger models, or longer training. It’s a theorem about what the supervised objective is . The formal statement: for any encoder φ* minimizing supervised loss on a distribution where nuisance feature n has correlation ρ with labels: The right-hand side is strictly positive and independent of model capacity and dataset size. It depends only on the data distribution. This bound holds for MSE, cross-entropy, and any other proper scoring rule. Plain language: if texture predicts your training labels, your model cannot stop being sensitive to texture. Suppressing it would cost task loss. This is forced. One Theorem, Four Things You Already Knew Were Problems This is what I find most interesting about the result. Four empirical findings that were previously treated as separate phenomena with separate explanations turn out to be corollaries of this single structural fact:
- Non-robust features (Ilyas et al. 2019) — ERM must encode any label-correlated direction, including imperceptible ones. Adversarial examples exist in exactly those directions. They transfer across models because the blind spot is determined by the data distribution , not the individual model.
- Texture bias (Geirhos et al. 2019) — When local texture statistics are easier label predictors than global shape, ERM cannot discard them. Texture bias is a geometric consequence of ERM under correlated nuisance, not an architectural inductive bias.
- Corruption fragility (Hendrycks & Dietterich 2019) — Common corruptions perturb exactly the nuisance-sensitive directions that cannot be suppressed under ERM. Degradation under unseen shifts is unavoidable, and its expected magnitude scales with ρ².
- Robustness–accuracy tradeoff (Tsipras et al. 2019) — Suppressing nuisance-correlated directions removes information ERM uses for in-distribution accuracy. The tradeoff isn’t architectural. It’s the cost of closing a blind spot the supervised objective opened, and its magnitude is predictable from ρ. These four research programs, years of papers, are all measuring different faces of the same geometric object. The PGD Result: This Is The Part That Surprised Me Here’s the table that made me double-check the code three times: PGD achieves the lowest Jacobian Frobenius norm — a 12× reduction from ERM. By every metric the robustness literature has used, PGD is “smoothing” the representations. But its clean-input geometry is worse than ERM (TDI 1.336 vs 1.093). The mechanism, which our Corollary 4 predicts: PGD compresses the Jacobian in the adversarial direction, like squeezing a balloon. The sensitivity doesn’t disappear — it redistributes into other directions. The Jacobian becomes nearly rank-1 (anisotropy index ≈ 2.1 for PGD vs 32.4 for ERM). When you probe isotropically — which is what TDI does, and what you’re implicitly doing at test time — those concentrated directions dominate and geometry is worse. The field has been reading low Jacobian Frobenius norm as evidence that adversarial training smooths representations. This is wrong. It measures magnitude redistribution, not geometric repair. Why CKA, Intrinsic Dimension, and Jacobian Fro All Miss This This is the diagnostic result. On the exact same comparison (ERM vs PGD vs PMH): Every metric the geometric-analysis-of-deep-learning literature uses is blind to Jacobian anisotropy. A model with sensitivity concentrated in one direction (rank-1 Jacobian) looks great on Frobenius norm — small magnitude — but is geometrically broken under isotropic probing. TDI measures expected squared path-length distortion under isotropic perturbation. This is the quantity Theorem 1 bounds. Nothing else measures it. Scale Makes It Worse, Not Better We measured the blind spot ratio across three BERT-family model sizes. A ratio below 1.0 means the encoder is more sensitive to surface-form variation (nuisance) than to semantic variation (signal): The ratio decreases monotonically. Larger models encode nuisance more precisely, not less , because greater capacity enables more faithful encoding of every label-correlated feature. This is a direct theoretical prediction, not a post-hoc observation: Theorem 1 says the blind spot magnitude scales with the nuisance-label correlation in the training distribution, and larger models approximate the Bayes predictor more closely, which means they encode the nuisance better . If you’ve been counting on scale to fix robustness, this result is uncomfortable. Fine-Tuning Amplifies the Blind Spot We measured paraphrase drift on BERT across three conditions: Task-specific ERM fine-tuning increases the blind spot by 54% relative to the pretrained model. The mechanism is straightforward: task labels introduce new spurious correlations (sentence length predicting sentiment, format predicting preference), and Theorem 1 says the model must encode them. The implication for RLHF is direct and uncomfortable. Preference labels carry spurious correlations — verbosity, formatting, surface markers of confidence. If the theorem applies (and there’s no reason it wouldn’t), RLHF is mathematically guaranteed to encode these alongside genuine preference signal. Sycophancy and length bias aren’t bugs in a specific implementation. They’re theorems about what RLHF does to representations. The Fix: One Additional Training Term Once you understand the mechanism, the fix is clear. You need to penalize the Jacobian uniformly across all input directions , not in one adversarial direction (PGD) and not in one arbitrary direction (standard augmentation). Proposition 5 proves: among all zero-mean perturbation distributions, Gaussian noise is the unique distribution that penalizes the Jacobian Frobenius norm uniformly across all input directions. Any other distribution — including adversarial — hits some directions more than others. Proof is one line from the trace formula: E_δ[‖Jφδ‖²] = Tr(J^T J Σ_δ) = σ²‖J‖²_F iff Σ_δ = σ²I. PMH adds one term to the loss: L_PMH = ‖φ(x) − φ(x + δ)‖², δ ∼ N(0, σ²I) By first-order Taylor expansion, this ≈ σ²‖J_φ‖²_F — directly suppressing the Frobenius norm uniformly. The Gaussian choice isn’t heuristic. It’s the unique solution. Results across seven tasks, three modalities, and foundation-model scale: Vision (CIFAR-10 ViT): −17.3% TDI Language (BERT SST-2): −28.7% TDI, −76.9% paraphrase drift Foundation scale (ImageNet ViT-B/16): −23.9% TDI CIFAR-10-C (official Hendrycks benchmark, 19 corruption types): +14.82pp mean accuracy, wins 18/19 corruption types PGD robustness without adversarial training: 48.94% vs VAT’s 32.38% at ε=4/255 Compute overhead: ~1.3× wall-clock, no architectural changes The intra-class representation distance increases 64% on ImageNet alongside TDI reduction — a by-product of suppressing nuisance sensitivity that forces the encoder to encode class-relevant features more discriminatively. The Diagnostic: TDI TDI (Trajectory Deviation Index) measures expected squared path-length distortion under isotropic perturbation, the exact quantity Theorem 1 bounds: TDI(φ, σ) = (1/L) Σ_ℓ E_{x,δ}[‖φ^(1:ℓ)(x+δ) − φ^(1:ℓ)(x)‖²] / E_x[‖φ^(1:ℓ)(x)‖²] A perfectly isometric encoder scores 0. TDI requires only a forward pass — no access to model weights or architecture. It’s measuring a property the theorem says any model trained on a given distribution must have, not a property of any specific model. The reason it catches the PGD failure that everything else misses: TDI penalizes Jacobian anisotropy. A rank-1 Jacobian has small Frobenius norm and high TDI simultaneously, because the isotropic probe hits the concentrated direction. Frobenius norm can’t see this. TDI is the only measure that can. What This Means Practically Every production model has this blind spot. Every real-world dataset has features spuriously correlated with labels. Theorem 1 applies. The shape of the blind spot is determined by your data distribution , measurable from data before training, via the spurious correlations in P(y|x). It’s not visible to accuracy metrics, CKA, intrinsic dimension, or Jacobian Frobenius norm. It’s measurable with TDI in one forward pass. Adversarial training, as standardly implemented, worsens clean-input geometry while improving one specific adversarial metric. If you care about robustness to distribution shift rather than specific adversarial attacks, PGD is making your model worse. PMH repairs the blind spot at every rung of the modern training hierarchy — from scratch, from pretrained backbones, through fine-tuning. One term, one forward pass overhead, no architectural changes. If you’re fine-tuning on task labels or preference labels, you’re actively worsening the blind spot unless you regularize it. This applies to instruction tuning and RLHF. Limitations (Being Honest) The bound is an existence result, not a tight predictor. The gap between the theoretical lower bound and observed drift is 10²–10³× — this is expected for existence theorems but means you can’t use the bound quantitatively to predict a specific model’s blind spot magnitude. PMH requires you to know which input directions are nuisance. On the QM9 molecular regression task, we initially applied noise to atomic positions (which are signal for quantum properties), and the method failed. Redirecting to node features fixed it. The theorem tells you the blind spot exists; you need domain knowledge to find it. The scale result is three data points (66M, 110M, 340M parameters). The pattern is consistent and theoretically predicted, but it needs replication at larger scales. This is a preprint, not peer-reviewed. The code is public and results are reproducible. TL;DR ERM provably cannot discard any label-correlated direction. This forces geometric roughness proportional to ρ (nuisance-label correlation), regardless of capacity or data size. Four major empirical findings (non-robust features, texture bias, corruption fragility, robustness-accuracy tradeoff) are corollaries of the same theorem. PGD adversarial training reduces Jacobian Frobenius norm 12× while worsening clean-input geometry (TDI). The field has been measuring the wrong thing. Larger models encode nuisance more precisely. The blind spot ratio worsens from 66M to 340M parameters. Task fine-tuning amplifies the blind spot 54%. RLHF has the same structural property. Gaussian noise is the unique perturbation distribution that suppresses the Jacobian uniformly (one-line proof). PMH adds one loss term using this, reduces TDI 17–29% across three modalities, wins 18/19 CIFAR-10-C corruption types, and achieves 48.94% PGD robustness without adversarial training. TDI is the only metric that catches the PGD failure. CKA, intrinsic dimension, and Jacobian Fro all miss it. Paper: https://arxiv.org/abs/2604.21395 Code: https://github.com/vishalstark512/PMH Happy to answer questions about the theory, the experiments, or the TDI diagnostic. submitted by /u/Difficult-Race-1188
Originally posted by u/Difficult-Race-1188 on r/ArtificialInteligence
