eifachposte

eifachposte

V5 update: we found the math bugs, fixed them, and a 28M model now beats V4’s 178M If you have not read my previous post, this one may be a bit unclear. Before commenting, please read my previous post with the code, implementation, and findings Original Post Link Here . but the short version from old post : I built a 178M-param language model where every token is a complex number (magnitude + phase), there are no attention layers or FFN blocks, and language processing happens through wave-like interference between specialized “phase banks.” The backbone is an oscillatory SSM with Cayley-transform rotations (no trig in the hot path), and context modifies meaning via phase rotation. It trained on TinyStories and showed real learning – but as this post explains, the math had serious problems. That post got useful attention, but after a deeper review I found something important: V4 was mathematically inconsistent yet it was learning great. It used complex-valued representations, but several core nonlinearities were still real-valued in a way that destroyed phase information. So V4 paid the cost of complex numbers without really preserving the thing that was supposed to make them useful. V5 is the cleanup. It is much smaller, the math is more honest, and the results are already materially better. And live on open source repo now. Open source: https://github.com/gowrav-vishwakarma/qllm2 What was broken in V4 The main issue was simple: V4 created complex states then applied real-valued activations/gates to them which threw away or corrupted phase information Examples from the old design:

GELU on only the real part F.gelu(h[…, 0]).unsqueeze(-1) * h # Real sigmoid gate on complex-derived features torch.sigmoid(self.gate_proj(gate_input))

If phase is supposed to carry relational structure, this is a fatal mistake. The network keeps converting complex structure into a mostly real computation. So the revised diagnosis is: V4 did not fail because complex numbers are bad for language. It failed because it used complex numbers badly. What V5 changes V5 is a ground-up redesign around one rule: If a representation is complex, the network should preserve that algebraic structure all the way through. Main changes: Architecture at a high level: Tokens -> ComplexEmbed -> [Bank + ComplexSSM + optional PhaseAttention] x N -> LM head The important conceptual shift is that V5 is not “wave metaphor first, math later.” It is: complex linear maps phase-preserving activations complex-aware gating controlled interference between banks a cleaner SSM/attention hybrid Where this sits relative to transformers and Mamba I do not think V5 should be described as “just another transformer” or “just standard Mamba with complex numbers.” It is closer to an SSM-centered hybrid : the main sequence backbone is a ComplexSSM , not full attention attention is used only sparsely the representation path is complex-valued end to end banks are fused through learned phase rotations and interference At the same time, I also do not want to pretend it is a pure end-to-end “wave machine.” Some control logic is still conventional and real-valued. For example: the bank router currently uses real magnitude features + GELU + softmax the SSM selectivity path uses a real projection to compute dt So the most honest description is: V5 is wave-dominant in its signal path, but hybrid in its control path. Roughly, compared to other families: So no, adding a few real-valued controller pieces does not make V5 a standard transformer. The core computation is still materially different. I also see this version as a controlled engineering compromise , not the final form of the idea. The mathematics I actually want are more phase-native than what current hardware and kernel stacks make convenient today. Right now, some controller paths stay real-valued because modern GPUs are exceptionally good at dense real GEMMs, softmax, and standard fused primitives, and I want to push the core hypothesis under realistic training constraints instead of waiting for a perfect systems stack. But I do not think this is where the architecture should stop. The more ambitious direction is to make routing, selectivity, and interference themselves more natively algebraic: fewer “convert to real, do the control step, convert back” bridges, more direct complex-valued control laws, better phase-aware kernels, and eventually custom fused kernels for the operations that are currently the bottleneck. That is the path I am already thinking about, and some of the next work is explicitly a systems problem, not just a modeling problem. So in that sense V5 is both a real model and a stepping stone: mathematically closer to the system I actually want, but still shaped by what current hardware can do efficiently. If better kernels (which I am also actively working on) and better tooling make the more phase-native version practical, I expect to pivot again rather than freeze the design here. Initialization mattered way more than I expected While testing V5, I ran a benchmark over 20 initialization strategies for complex-valued layers. This turned out to matter a lot. Best strategies (1k samples, 5 epochs, 3 seeds) Orthogonal init was about 2x better than random in this benchmark. Then I ran a longer A/B test: Orthogonal vs random (5k samples, 10 epochs, 3 seeds) So orthogonal was still 31% better at epoch 10 , not just an early-training trick. I also removed 8 clearly broken strategies after testing. Spirals and several quasi-random geometric constructions were consistently much worse than random, and some produced NaNs. Training results

Random-init V5, 100k TinyStories samples Model: small-matched Params: 28.7M Setup: 10 epochs, random init, A6000 This was already much smaller than V4 and far more stable.
Orthogonal-init V5, same 100k-sample run Same model, same data size, same 10 epochs, but with orthogonal init ( seed=42 ). Comparison against the earlier random-init run: That is the first result that made me think: okay, this is no longer just “interesting idea, weak numbers.” Important caveat: the random-init 100k run was on A6000 the orthogonal 100k run was on RTX 4090 So the throughput numbers are not apples-to-apples across those runs. The quality comparison is still valid because the model/data/training schedule are the same, but speed comparisons should not be overinterpreted. Sample generation from the orthogonal 100k run Prompt: The quick brown This sample is obviously still small-model / TinyStories quality, but it is much cleaner than the earlier V4 generations. Full-dataset run: epoch 3 complete After the 100k-sample runs, I switched to the full TinyStories train split. Current run: model: same 28.7M small-matched V5 init: orthogonal ( seed=42 ) data: full TinyStories train split samples tokenized: 2,119,489 tokens: 473,992,006 batches/epoch: 103,744 (~7.2h/epoch on RTX 4090) Full training log (up to epoch 3): v5_train_small-matched.log Training curves (loss, PPL, LR schedule, throughput, wall time): https://preview.redd.it/4egaq4elqgng1.png?width=1440&format=png&auto=webp&s=c7cf7a07ac1410db98faab66ce20748e9ee2955f Finished so far (epoch 4 now in progress): What matters most here: on the full dataset, epoch 1 already beats the 100k-sample run’s epoch-10 result (6.27 vs 8.00) by epoch 3, val PPL is 5.59 – 30% better than the best 100k result the curve is still dropping steadily with no sign of plateauing train/val gap at epoch 3 is only ~0.38, so overfitting is not the limiting factor Qualitatively, the generations are improving each epoch. Prompt: The quick brown Epoch 1: Epoch 2: Epoch 3: Still 7 epochs to go. I will post the final numbers when it completes. (or connect me https://www.linkedin.com/in/gowravvishwakarma/ ) This is the first run where I feel comfortable saying V5 has moved from “interesting architecture experiment” to “actually promising.” What I think I learned Three takeaways so far: The math details matter more than the concept pitch. “Complex numbers for language” is not enough. If your nonlinearities and routing destroy phase, the idea collapses. Initialization is not a minor detail in complex-valued models. In this setup it changed results dramatically. Smaller but mathematically cleaner beat bigger and sloppier. V5 at 28.7M is already doing better than the much larger V4 design I posted before. Honest limitations This is still early and I do not want to oversell it. I have not yet run a strict apples-to-apples transformer baseline at the same parameter scale and same training budget no long-context benchmark yet no downstream benchmark yet still pure PyTorch, no custom kernels scaling behavior beyond this size is still unknown So I am not claiming “complex numbers beat transformers.” I also want to be clear that my goal is not just to beat current LLMs on next-token prediction or build a slightly better chatbot. Language modeling is the training interface I am using right now because it is measurable and gives fast feedback, but the deeper objective is to explore whether more structured phase-aware / algebraic representations can capture subtler relational structure, nuance, and latent organization in data than today’s standard architectures. In that sense, V5 is a stepping stone, not the endpoint. If this line of work also improves generation, that is valuable, but generation itself is not the full reason I am pursuing it. What I am claiming is narrower: A mathematically consistent complex-valued LM seems substantially better than my earlier inconsistent version, and the current training results are strong enough to justify taking the idea seriously. What happens next finish the full-dataset run run an apples-to-apples baseline continue ablations on bank design and routing scale up the model write a cleaner V5 paper draft If people are interested, I can post the final full-dataset numbers when the run completes. I would especially value feedback on: whether the diagnosis of V4 makes sense whether the V5 changes are the right fixes what the fairest baseline would be for comparison whether this is worth pushing into a paper / benchmark-heavy evaluation phase Also: I am planning to write this up properly and submit a V5 paper to arXiv once the results stabilize. If anyone here is in a position to help with arXiv endorsement and is open to it, I would really appreciate it if you DM me. One more thing : V5 is not the final form of this idea. The longer-term direction I am working toward is substantially different – possibly V11 or V12 before it gets there. Now that text representations already live in a complex phase/latent space, the natural next step is to explore diffusion over that space before moving toward something more genuinely quantum-inspired rather than the current algebraic framework. So if V5 looks like “just” an SSM with complex numbers, that is because the architecture is still early in a much larger arc. If you have read this far and think this work should stay open source, please star the repo and watch for updates . Share this post if you know people who might care. If you know other subreddits or communities where this would resonate, sharing it there would help connect with more likeminded people. I am also looking to connect with people who can invest in these ideas — not only with funding (which matters), but with actual work on the project too. If that describes you or someone you know, reach out. submitted by /u/ExtremeKangaroo5437

Originally posted by u/ExtremeKangaroo5437 on r/ArtificialInteligence

V5 Update: Original post title ... I built a language model where tokens are complex numbers and "meaning" emerges from wave interference -- no attention, O(n), 178M params, open-sourcing today (V4)

V5 Update: Original post title ... I built a language model where tokens are complex numbers and "meaning" emerges from wave interference -- no attention, O(n), 178M params, open-sourcing today (V4)

GELU on only the real part F.gelu(h[…, 0]).unsqueeze(-1) * h # Real sigmoid gate on complex-derived features torch.sigmoid(self.gate_proj(gate_input))