Original Reddit post

TL;DR: Removing the right layers (instead of shrinking all layers) makes transformer models ~8–12% smaller with only ~6–8% quality loss , and this now works across architectures (GPT-2 + TinyLlama) with near-zero variance. I’ve been experimenting with depth-first pruning — removing entire layers based on sensitivity rather than shrinking model width. Started on GPT-2… Just validated it on TinyLlama 1.1B with full 3-seed replication. 🧠 Results (TinyLlama 1.1B) Depth-First Pruning (3 seeds) Config Layers Reduction Test PPL Ratio ------------------------- ------- ---------- -------------- ------ Baseline (22L) 22 0% 9.19 1.000 20L (remove L4 + L11) 20 8.0% 9.72 ± 0.01 1.057 19L (staged pruning) 19 12.0% 9.94 ± 0.01 1.081 ⚡ What’s interesting Extremely stable → ±0.01 PPL across seeds Transfers across GPT-2 and Llama-family models Keeps quality within ~6–8% while reducing size Produces real inference speedups , not just parameter savings 🧠 Key insight Not all transformer layers matter equally. Removing the least important layers : preserves useful structure avoids degrading all layers beats uniform width pruning 🔥 Takeaway 👉 Structure > uniform scaling Instead of: “make every layer smaller” Do: 👉 “remove the layers that matter least” ⚠️ Notes Not a new architecture Not claiming SOTA Just a clean, reproducible efficiency method 🧠 Bigger picture This is part of a broader direction I’m exploring: Seed → architecture discovery (finds efficient models) Magnus → memory-first reasoning system Goal: 👉 smaller, structured systems instead of bigger models Curious what people think, especially if you’ve tried similar pruning approaches and your results. submitted by /u/califalcon

Originally posted by u/califalcon on r/ArtificialInteligence