Original Reddit post

Deep learning in rust, this crate is for building and experimenting with ViT-style image, video, sequence, and self-supervised transformer models in Rust. It provides typed configs, reusable model structs, runnable examples, and shape tests for research prototypes and Rust deep learning projects. Now a Vision Transformer treats an image like a sequence. Normal images have this shape: [batch, channels, height, width] The model changes the image into this shape: [batch, tokens, dim] The flow is: Split the image into patches. Flatten each patch into one long vector. Project each patch vector into dim. Add position embeddings. Run transformer layers. Pool the tokens. Predict class logits. If you wanna learn more see here: https://github.com/iBz-04/vitch submitted by /u/Ibz04

Originally posted by u/Ibz04 on r/ArtificialInteligence