I wrote up a deep dive on implementing scan / prefix-sum efficiently on GPUs, with code and benchmarking. What’s covered: Hierarchical scans: block-local scan → write block totals → scan totals → carry-in add Single-pass scans: the “domino” idea, and why naive inter-block propagation can stall / deadlock without the right coordination Decoupled lookbacks: how modern single-pass scans coordinate across blocks safely Warp-window lookback optimization: scanning lookback metadata in warp-sized chunks (and why it helps) I also include H100 timings and compare against CUB for context. Post: https://shreyansh26.github.io/post/2026-02-19_cuda-scan-kernels/ submitted by /u/shreyansh26
Originally posted by u/shreyansh26 on r/ArtificialInteligence
You must log in or # to comment.
