Hi everyone!! I really wanted to share my research what I’ve been working on. I’ve posted about this on locallama and some other subs. I wanted to build a nn that can simulate games, or at least start doing that Most video generators are too large to run on consumer hardware realtime, so I I designed a model that does this from scratch. No fine tuning bs or anything. Just starting from a pretrained VAE The core denoiser network is fully trained from scratch to support this goal. From image to games data. That video. above is on a RTX 5090. The nn is a small Transformer-like model and works in a causal way, just like LLMs. That lets us KV Cache all past information and do a simple autoregressive decode forward passes for every new frame we want. In the video shared, the model is a 0.5B variant with some SIGNIFICANT ISSUES like poor motion and some weird flashes, some context issues It’s taking the keyboard actions I give it in realtime and utilising that in the forward pass. (no classifier free guidance though) Im training the next iteration , a 0.8B model now. (its not going good) Btw I haven’t done quantisation yet, that can save a LOT more time. bf16 is slow. I have a lot more cooler images Im trying. Ive only started trying ps I deleted this in the morning because my rdit account had some serious issues so im posting it now submitted by /u/lucidml_lover
Originally posted by u/lucidml_lover on r/ArtificialInteligence
