Original Reddit post

Hey guys, I was playing around with Nanotron recently and got super frustrated by how many heavy, hardware-specific dependencies it imports at the module level ( flash-attn , triton, functorch , etc.). If you try to run it on older or budget GPUs like a T4 or V100, it just crashes on import. So I wrote Picotron ( https://github.com/Syntropy-AI-Labs/picotron ) to solve this. It’s a clean-room rewrite that gets rid of all mandatory GPU-specific dependencies. It runs on pretty much any GPU that supports PyTorch (defaults to FP16 on older cards under compute capability 8.0, and BF16 on newer ones). It falls back to standard PyTorch SDPA by default, but still hooks into FlashAttention-2 at runtime if it detects you have it installed. I used an AI assistant to write a lot of the boilerplate/code modules, but I’ve got it working locally and just trained a tiny 2M model onFineWeb-Edu. Also added configs for: • GQA / MLA (Multi-head Latent Attention) • QK-Norm & logit soft-capping (Gemma 2 style) • Parallel FFN/Attn runs • ZeRO-1 wrapping on DDP Roadmap is pretty short right now: MoE prep (routing capacity factors and load balancing loss) Making dataset prep easier than streaming manually Check it out if you’ve been fighting with CUDA dependency hell: https://github.com/Syntropy-AI-Labs/picotron submitted by /u/Capital_Savings_9942

Originally posted by u/Capital_Savings_9942 on r/ArtificialInteligence