If this post gets enough traction, I’ll go back and run the full V4-Pro (1.6T params), rerun all of these experiments on it, plus run the top-upvoted experiments people request in the comments. Drop your test ideas below.
DeepSeek V4 dropped a few days ago with a novel architecture: manifold-constrained hyper-connections (mHC) replacing standard residual connections, plus 256-expert MoE and sparse attention. The marketing claims mHC provides “stability” and “preserves expressivity.” Nobody has publicly analyzed what it does at inference yet, so I rented 8x H100s and dug in. This is a measurement post, not a benchmark post. I captured hidden states, expert routing, and SVD structure across 7 prompts (5 short, 2 long) and looked for what’s actually happening inside. TL;DR: V4-Flash exhibits an extreme attention sink with deterministic dimensional structure. mHC’s hyper-connection copies become functionally redundant by layer 3. The “novelty” appears to be a magnitude-channeling mechanism that funnels growth into specific BOS dimensions, leaving content tokens to behave like a normal transformer.
Setup
- 8x H100 SXM (8x80GB), tensor parallel
- DeepSeek V4-Flash (284B total, 13B active, 43 layers, 256 experts, 6 active per token, hc_mult=4)
- FP8 conversion, ~310GB on disk
- 7 prompts: 5 short factual/code/quantum/story/math, 2 long (Roman Empire wiki paragraph at 331 tokens, attention transformer code at 641 tokens)
I hooked Block forward outputs (shape
[batch, seq, hc_mult, dim]) and Gate forward returns (routing weights and expert indices). Tilelang fused kernels prevented attention pattern access — sparse_attn doesn’t materialize attention scores.
Finding 1: Extreme attention sink with three dimensional registers
BOS token magnitudes grow 1,800x from layer 0 to layer 42 (28 → 69,632). Non-BOS tokens grow ~70x — totally normal. The growth is BOS-only. BOS-to-non-BOS magnitude ratio across the network:
- Layer 5: 79x
- Layer 20: 12x (sink shrinks)
- Layer 26: 66x (sink reactivates)
- Layer 30: 328x
- Layer 40: 896x peak
- Layer 42: 250x (final layer pulls back for output prep) For comparison: standard attention sink papers report ratios in the 10-100x range. V4-Flash hits ~900x. The interesting part is where the sink lives. The BOS magnitude is dominated by specific dimensions in succession:
- Layers 4-10: dim 3279 dominates
- Layers 11-23: dim 2120 dominates
- Layers 31-42: dim 3077 dominates Three distinct “sink registers” with brief transitions between them. Non-BOS tokens have ~6,000x less magnitude in these dimensions than BOS does. The model has learned to use specific dimensions as scratch space for the sink, leaving other dimensions clean for actual content.
Finding 2: Hyper-connection copies are functionally redundant
V4-Flash maintains 4 parallel “copies” of every token via hyper-connections (hc_mult=4). The mHC mechanism mixes them via Sinkhorn iterations at every block. Within-layer CKA between hc copies:
- Layer 0: 0.954 (some divergence)
- Layer 3: 0.9999+ (essentially identical)
- Layer 42: 0.9999+ (still identical) The 4 copies become near-identical by layer 3 and stay that way for the entire network. Whatever benefit mHC provides during training, the 4-way redundancy isn’t producing genuinely different views at inference. Token-level information flow (concatenated hc copies, treating each token as one big vector) shows concat CKA = 1.000 between every adjacent layer pair — identical to standard residual stream behavior in models like Qwen 14B.
Finding 3: Effective rank stays low; sink dominates SVD
Effective rank with all positions: ~1-2 throughout the network. One direction dominates everything because the BOS sink is so large. Effective rank excluding BOS: 6-17, normal transformer behavior. So the model has normal representational capacity for content; the “rank-1 collapse” is purely the sink. This explains why naive CKA analysis (which treats all positions equally) showed apparent “disruption layers” at 25-30 and 39-40. Those weren’t structural reorganizations — they were sink-dimension transitions where the dominant direction rotated to a new axis.
Finding 4: Expert routing — no dead experts, dedicated BOS allocation
All 256 experts get used across the data. Zero dead experts. Std/Mean of expert usage = 0.314 (relatively uniform). This is much better than typical public MoE models, which often have 5-30% dead experts. BOS routing is deterministic: across all 7 prompts, BOS at layer N routes to the exact same 6 experts every time. But — and this is the surprise — adjacent layers have near-zero expert overlap for BOS (mean Jaccard = 0.014). 156 different experts handle BOS across 40 score-routed layers. The sink isn’t processed by a small set of dedicated “sink experts.” It’s distributed across 61% of the expert pool, with each layer getting fresh experts. Position-dependent specialization in the long_code prompt:
- BOS: 138 unique experts, 13.8% top-10 concentration
- Content tokens (early/middle/late): 256 unique experts each, ~9% concentration BOS gets concentrated routing. Content tokens use the full pool uniformly.
Finding 5: Secondary sinks emerge at structurally-meaningful tokens
In the 641-token code prompt, high-magnitude positions beyond BOS appeared at:
- pos 26:
import(keyword) - pos 36:
Attention(class name) - pos 524:
Block(class name) - pos 593:
Multi(class name prefix) - pos 638:
)(closing paren) - Multiple parameter names and type annotations Not random tokens. Class names, keywords, type annotations, structural code identifiers. The model treats these as secondary registers — smaller than BOS but elevated above standard content tokens. Worth noting these results are from one long prompt, so the pattern needs more data to confirm it generalizes.
Finding 6: Thinking mode vs chat mode is mostly cosmetic
I ran 4 prompts in both thinking_mode="chat" and thinking_mode="thinking". The two modes differ by exactly one token (the mode marker).
- BOS magnitudes: bit-identical between modes (causal attention isolates BOS from later tokens)
- Expert routing: 90-94% Jaccard overlap on non-BOS positions
- Last token (where the marker token actually lives): thinking mode produces 10-22% lower magnitudes by late layers Suggests thinking mode is mostly an output-formatting difference, not a separate “reasoning circuit” at the prefill level. The model isn’t doing fundamentally different computation in thinking mode — it’s just being told to produce different output.
What this adds up to
V4-Flash at inference looks like a standard transformer with: A more aggressive attention sink than typical Three dedicated dimensional registers for sink magnitude in succession Distributed expert allocation for sink processing 4 hyper-connection copies that collapse to redundancy by layer 3 Token-level information flow indistinguishable from standard residual streams All 256 experts utilized efficiently The mHC mechanism doesn’t appear to produce dramatically different inference-time computation compared to standard residual connections. The “manifold constraint” empirically shows up as magnitude-channeling — runaway growth gets funneled into specific BOS dimensions, freeing content dimensions to behave normally. Whether that’s the intended novelty or a side effect, I can’t tell. mHC’s training dynamics might do something more interesting that doesn’t manifest at inference. From inference data alone, the architectural novelty is more subtle than the marketing suggests.
Caveats
- N=7 prompts, mostly short. Per-prompt variability is small but not zero.
- Inference only. Training-time behavior could be where mHC actually matters.
- V4-Flash, not V4-Pro. The Pro model (1.6T params) might behave differently at scale.
- No attention pattern access — sparse_attn fused kernel hides the scores. We measured consequences (magnitude, routing) not the patterns producing them.
- No probing — no trained classifiers on hidden states. Structural analysis only.
What it cost
About $85 of cloud GPU time across two pod sessions. First pod was a failed attempt at V4-Pro that ran out of disk during conversion. Second pod ran the actual V4-Flash analysis in ~3 hours.
For anyone wanting to reproduce: V4-Flash needs roughly 1TB volume disk on RunPod (137GB original + 310GB FP8 converted + working space). 8x H100 SXM works. Tilelang 0.1.8 has a _NestedLoopCheckVisitor bug — upgrade to latest. Expert routing hooks go on the Gate module (in model.py), Block-level hooks on the layers themselves.
Happy to share the capture/analysis scripts if anyone wants to build on this. The data files (hidden state stats, routing JSONs, SVD outputs) are about 3MB total — minimal compared to the 310GB of weights they were extracted from.
submitted by
/u/BLOCK__HEAD4243
Originally posted by u/BLOCK__HEAD4243 on r/ArtificialInteligence
