CUDA scan kernels: hierarchical vs single-pass, decoupled lookbacks

www.reddit.com

CUDA scan kernels: hierarchical vs single-pass, decoupled lookbacks

www.reddit.com

eifachposteMB to AI (Reddit RSS)English · 8 days ago

Original Reddit post

I wrote up a deep dive on implementing scan / prefix-sum efficiently on GPUs, with code and benchmarking. What’s covered: Hierarchical scans: block-local scan → write block totals → scan totals → carry-in add Single-pass scans: the “domino” idea, and why naive inter-block propagation can stall / deadlock without the right coordination Decoupled lookbacks: how modern single-pass scans coordinate across blocks safely Warp-window lookback optimization: scanning lookback metadata in warp-sized chunks (and why it helps) I also include H100 timings and compare against CUB for context. Post: https://shreyansh26.github.io/post/2026-02-19_cuda-scan-kernels/ submitted by /u/shreyansh26

Originally posted by u/shreyansh26 on r/ArtificialInteligence

You must log in or # to comment.

Chat