Original Reddit post

I’m currently building smolcluster, a project focused on demystifying how distributed learning actually works under the hood- both for training and inference. This initiative distills complex information into digestible content for anyone interested in learning more about these algorithms, like FSDP DP MP PP A major part of this work has been implementing these systems from scratch in Python using raw sockets, not relying on high-level frameworks, so the communication, synchronization, and scaling behavior are explicit and understandable. A key highlight of this project is its versatility; it can be utilized with various types of computing devices, including laptops, Mac devices like Mac minis, NVIDIA GPUs in laptops or workstations, and even tablets and phones. I see these as potential computing resources that are currently underutilized. My goal is to leverage them to teach others how to use heterogeneous computing to explore distributed learning from the comfort of their homes with the devices they already own. Ultimately, this is about making distributed learning more accessible: giving people the tools and intuition to explore these systems from their own setups, without needing access to large-scale infrastructure. This is one of my session of running a training run fr previous summarization project using GRPO on 3xMac Minis 2024 16GB each ones using Synchronous Parameter-Server architecture with the one node doing the training and other as vllm-metal workers! PS: Its very early and under heavy development. Would love to get views and ideas for the same and let me know if you have any questions! submitted by /u/East-Muffin-6472

Originally posted by u/East-Muffin-6472 on r/ArtificialInteligence