Educational PyTorch repo for distributed training from scratch: DP, FSDP, TP, FSDP+TP, and PP

www.reddit.com

Educational PyTorch repo for distributed training from scratch: DP, FSDP, TP, FSDP+TP, and PP

www.reddit.com

eifachposteMB to AI (Reddit RSS)English · 4 hours ago

Original Reddit post

I put together a small educational repo that implements distributed training parallelism from scratch in PyTorch: https://github.com/shreyansh26/pytorch-distributed-training-from-scratch Instead of using high-level abstractions, the code writes the forward/backward logic and collectives explicitly so you can see the algorithm directly. The model is intentionally just repeated 2-matmul MLP blocks on a synthetic task, so the communication patterns are the main thing being studied. Built this mainly for people who want to map the math of distributed training to runnable code without digging through a large framework. Based on Part-5: Training of JAX ML Scaling book submitted by /u/shreyansh26

Originally posted by u/shreyansh26 on r/ArtificialInteligence

You must log in or # to comment.

Chat