eifachposte

eifachposte

CORAL, PGD adversarial training, data augmentation, and RLHF alignment constraints are not different methods. They are different research communities trying to compute the same matrix, without realizing there is a matrix to compute. This isn’t an analogy. It’s algebra. And the consequences of getting that matrix wrong are worse than the field currently understands. The matrix everyone is estimating Every robustness problem has the same hidden structure. At deployment, inputs change — lighting shifts, scanner models drift, accents vary, prompt styles evolve — but ground-truth labels stay fixed. The question hiding inside every robustness failure is always identical: Which directions of input change can the encoder completely ignore while still predicting correctly? Call the covariance of those directions Σ_task . It’s the label-preserving deployment nuisance covariance — which directions in input space move at deployment without changing the label. Every method below is estimating it. The derivation Take Deep CORAL. It minimises ‖C_S^φ − C_T^φ‖²_F where C_S, C_T are source/target feature covariances. Linearise the encoder around the source mean: C_S^φ − C_T^φ ≈ J_φ (Cov_S(x) − Cov_T(x)) J_φᵀ = J_φ Σ_dom J_φᵀ ‖J_φ Σ_dom J_φᵀ‖²_F ≤ ‖J_φ‖²_op · ‖Σ_dom‖op · Tr(J_φᵀ J_φ Σ_dom) That last term is a Jacobian penalty along Σ_dom = Cov(x_T − x_S). Which is exactly the deployment nuisance covariance. CORAL is not doing domain alignment. It is penalising the encoder’s Jacobian along Σ_task, up to bounded operator-norm factors. Same derivation for augmentation: E{x,k}[ℒ(θ; a_k(x))] = E_x[ℒ(θ; x)] + ½ E_x[Tr(J_φᵀ H_φ J_φ Σ_aug)] + O(‖δ‖³) where Σ_aug = 1/K Σ_k E_x[δ_k δ_kᵀ] Augmentation is Jacobian penalisation along the augmentation-delta Gram. Same thing. PGD adversarial training: averaging over adversarial deltas δ* at radius ε gives an expected loss whose first non-trivial Jacobian term is: (ε²/2) E_x[Tr(J_φᵀ H_φ J_φ Σ_PGD)] where Σ_PGD = Cov(δ*) Three methods. Three linearisations. One matrix. The table The theorem that cannot be argued with Knowing these methods estimate the same matrix is interesting. What the paper actually proves is what happens when you get it wrong. Theorem G (proved unconditionally, no extra assumptions): No quadratic Jacobian penalty — not CORAL, not PGD-AT, not augmentation — can zero deployment drift without covering the full range of Σ_task. If your penalty matrix misses even one direction where deployment varies, the encoder exploits that unpenalised gap. It learns to amplify variations along the blind spot to minimise training loss. The resulting drift floor is: Range mismatch: Θ(1) — permanent, structural, independent of λ, data size, or model scale Allocation mismatch within correct range: Θ(λ⁻³) — vanishes as λ → ∞ Matched global minimum: O(λ⁻²) → 0 The proof is three lines. If range(A) doesn’t cover range(Σ_task), pick a unit vector q in the gap. Then Aq = 0, so (I + 2λA)⁻¹q = q for all λ. Therefore D̃_Q = qᵀ Σ_task q > 0 forever, regardless of regularisation strength. You cannot train your way out of a geometric blind spot. More data doesn’t help. Larger models don’t help. Higher λ doesn’t help. The gap is structural. The loss function Once you know what you’re estimating, the training procedure becomes a formula. The paper calls it the PMH loss: ℒ_pmh(θ) = ℒ_task(θ) + λ · E_x[Tr(J_φ(x)ᵀ J_φ(x) Σ̂_task)] In practice, estimate Σ_task from data, add one trace penalty term, cap it at cap/(1+cap) of task loss to fix λ automatically. The same 12 lines of PyTorch run across every modality — only the matrix changes: def pmh_penalty(encoder, x, Sigma, n_probes=4): L = torch.linalg.cholesky(Sigma + 1e-6 * torch.eye(x.shape[-1])) phi0 = encoder(x) acc = 0.0 for _ in range(n_probes): acc += (encoder(x + torch.randn_like(x) @ L.T) - phi0).pow(2).sum(-1).mean() return acc / n_probes loss = task_loss + lam * pmh_penalty(encoder, x, Sigma_hat) # matched ctrl_wrong = lam * pmh_penalty(encoder, x, U @ U.T) # should ≈ isotropic ctrl_signal = lam * pmh_penalty(encoder, x, torch.outer(s,s)/s.dot(s)) # should hurt Those last two lines are not optional. A matched-arm result without both controls is uninformative. Three predictions made before experiments ran The paper pre-registers three quantitative checks in the theory section before any experiments run. Each specifies not just what matched PMH should do, but what the controls should do. Check 1 — Lemma C: A random rank-r penalty matrix (wrong-W) equals isotropic PMH at scale r/d_x in expectation, by the Haar measure on the Stiefel manifold. Predicted D_N/D_S gap between wrong-W and isotropic: ≤ 5%. Observed (T7B CIFAR ViT): 2.98 vs 3.11 → 4.2% gap. Within concentration bound. Check 2 — Corollary E★: Penalising along the signal direction (keyword-PMH in code clone detection) must hurt below baseline. The proof gives Ω(ρ²) penalty on task risk. Observed (T5B BigCloneBench): rename_bacc_ratio 0.830 → 0.738. Below baseline by 9.2pp. Check 3 — Corollary 3.4: PGD-AT should win robustness but exit the clean-accuracy Pareto frontier. Adversarial deltas don’t implement isotropic Jacobian shrinkage — trajectory TDI can worsen even as ‖J‖_F drops. Observed (T7B): PGD-AT 44.8% robust / 64.6% clean vs baseline 79.4% clean. −14.8pp. TDI 1.506 vs matched 0.870. The subspace staircase Block T7B (CIFAR-10, ViT-Small) is the cleanest direct test of the theory. As Ŵ quality improves, adversarial robustness increases monotonically: Estimator quality → PGD@4 acc TDI D_N/D_S ───────────────────────────────────────────────────────── No PMH (baseline) 26.3% 1.09 1.19 Random Ŵ (wrong-W) 11.1% 1.00 2.98 ← collapses Gradient-SVD estimate 15.6% 0.870 0.50 PGD-delta Gram (matched) 21.1% 0.870 0.19 ───────────────────────────────────────────────────────── PGD-AT (dissociation) 44.8% 1.506 2.48 ← off-Pareto clean accuracy: 64.6% (vs baseline 79.4%, −14.8pp) Better matrix estimate → better geometry → better deployment performance. Every step ordered. No exceptions. Note that wrong-W collapses robustness below baseline. Random penalty directions don’t just fail to help — they actively disrupt the encoder. This is Theorem B part (i): range mismatch costs Θ(1), and a random subspace almost surely misses the adversarial directions. The result that proves the theory — a predicted failure On Office-31 (Amazon → DSLR), matched PMH loses to CORAL. CORAL 25.2%, matched PMH 23.3%. This is the strongest evidence in the paper. Before running the experiment, the eigengap pre-flight computed γ_r ≈ 1.03 at rank 32 on the 200-sample target pool. The framework predicted: at this eigengap, the subspace estimator Ŵ is unreliable (Davis-Kahan: ‖Π_Ŵ − Π_W‖_F ≲ 2‖Ĉ−C‖op / γ blows up as γ → 0), and CORAL’s moment alignment — which doesn’t require subspace identification — should win. The prediction was correct in every detail. A framework that accurately predicts its own failures from first principles is doing something qualitatively different from one that only explains its successes. The Office-31 result is a predicted consequence of a named mathematical condition, not a surprise to be explained away. Thirteen blocks. One formula. Five modalities. Same 12 lines of code, same penalty template, same falsification controls: 12 of 13 pass. The one failure (Office-31) was named and predicted before experiments ran. The alignment result This is the application most people will miss because it doesn’t look like a robustness paper. Standard DPO preference fine-tuning raises Style TDI by 30% — 1.851 → 2.408. The model’s hidden-state geometry becomes more sensitive to style variations during training. The reward model cannot reliably distinguish “this response is correct” from “this response matches the style the user implied they want.” The model learns to game style. This is sycophancy, geometrically. One extra trace penalty term — Σ̂_style estimated from 96 prompts × 6 style rewrites: Style TDI: Pre-DPO baseline: 1.851 Standard DPO: 2.408 (+30% — geometry degrades) Matched style-PMH DPO: 1.836 (−0.8% — geometry preserved) Isotropic PMH: 2.045 Sycophancy rate (TruthfulQA, n=500): Baseline: 38.5% Matched PMH RM: 13.5% Content/style ratio: 2.6× → 3.1× (matched arm) The same formula used for ImageNet corruption robustness and accent-robust speech recognition preserves style-content geometric separation during preference fine-tuning. The method doesn’t know it’s doing alignment. It’s doing geometry. What the paper cannot prove Theorem A★ proves that at the global minimum of the PMH loss, range matching drives drift to zero. Whether gradient descent actually reaches that global minimum — assumption (O) — is open. Every empirical result is consistent with the theory. None of them constitute a proof at the optimisation level. This is stated explicitly in the paper, not buried. The 13 blocks are observational synthesis, not a joint inference theorem. The open problem The framework names eight open problems explicitly (Table 9). The central one: (O) Optimisation reachability: Theorem A★ is a global-minimum statement. Whether SGD reaches it — in non-convex landscapes, at scale, across architectures — is the central unresolved question the framework inherits from all of deep learning. This is not a buried limitation. It is the open problem that shapes the next set of papers. The practical recipe Five steps. Runs on any architecture. Same code across all 13 blocks: Identify the nuisance family. Which A_k describes your deployment shift? Isotropic noise → σ̂²I. Domain shift → cross-domain Gram. Augmentation modes → aug-delta Gram. Style/adversarial → style-pair or PGD-delta Gram. Run the eigengap pre-flight. Compute γ_r = λ_r / λ{r+1} on held-out deployment pairs. If γ_r < 1.2, expect Office-31-type failure. Fall back to isotropic PMH. Add the trace penalty. loss = task_loss + lam * pmh_penalty(encoder, x, Sigma_hat) Cap it. pmh_loss ≤ cap * task_loss gives steady-state fraction cap/(1+cap). No λ tuning required. Run both controls. Wrong-W should ≈ isotropic. Signal-W should hurt below baseline. A positive result without both controls is uninformative. What this means If this holds up — and 13 blocks across 5 modalities with 3 pre-specified falsification checks and 1 accurately predicted failure is meaningful evidence — then: Robustness stops being a collection of engineering tricks and becomes an estimation problem. Identify which assumption describes your deployment nuisance. Estimate Σ_task. Check the eigengap. Add one term. Run two controls. Methods stop being independent and become estimators of the same object with different assumptions and named failure modes. CORAL fails when the eigengap is marginal. Augmentation fails when corruptions leave the augmentation family. PGD-AT fails when the decoder Hessian distorts the allocation. These are not empirical discoveries. They are consequences of one necessity theorem, predicted in advance. The loss function stops being background infrastructure and becomes the primary design variable. One PSD matrix per nuisance type. Closed-form optimum. Two falsification controls fixed before training. Links Paper: “The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning” — search arXiv for “geometric theory loss functions nuisance robust” Code: pip install matching-pmh · https://github.com/vishalstark512/matching-pmh Happy to go deep on any specific block, the proof of Theorem G, the alignment geometry, or the estimator selection problem in the comments. submitted by /u/Difficult-Race-1188

Originally posted by u/Difficult-Race-1188 on r/ArtificialInteligence

We spent a decade inventing "new" robustness methods. They're all computing the same matrix. Here's the proof.

We spent a decade inventing "new" robustness methods. They're all computing the same matrix. Here's the proof.

The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning