eifachposte

eifachposte

A researcher named leyten published a project called Shard this week and the results are genuinely exciting. They split GLM-5.2 (744B parameters) across 6 RTX Pro 6000 GPUs in Nevada, Texas, Washington, Minnesota, Missouri, and Utah — connected over regular WAN with 22-75ms latency between nodes — and achieved ~30 tokens/second. For context, the previous best attempt at this (Petals, 2022) got 1-2 tok/s on much smaller models. This is a 15-20x improvement and a meaningful moment for decentralized AI. How they did it: Three techniques combined: Speculative decoding over WAN — a small draft model proposes K tokens, the distributed large model verifies them all in one network round-trip. WAN latency is the scarce resource, so you amortize it. Ring pipelining with direct return — the final node sends results directly back to the coordinator instead of relaying through every stage. CUDA-graphed draft model — pre-compiling the draft model as a CUDA graph gave a 3.8-5.3x speedup. Baseline to final: Plain WAN decode: 1.87 tok/s async pipelining: 16.6 tok/s CUDA-graphed draft: ~30 tok/s Shard is the infrastructure powering c0mpute.ai — a network where anyone can contribute their GPU and earn USDC for running inference jobs. The network has its own token, $ZERO, which accrues value as the network grows. This result shows the foundation is real and the engineering is serious. Every run has a published receipt with GPU UUIDs, IP addresses, latency measurements and output hashes. Code is open source. Repo: github.com/leyten/shard submitted by /u/amu4biz

Originally posted by u/amu4biz on r/ArtificialInteligence

Someone just ran a 744B parameter model at 30 tok/s across 6 consumer GPUs in 6 different US states over the open internet

Someone just ran a 744B parameter model at 30 tok/s across 6 consumer GPUs in 6 different US states over the open internet