Presentation
SIGN IN TO VIEW THIS PRESENTATION Sign In
Autonomous Execution for Multi-GPU Systems: Compiler Support
DescriptionRecent trends in HPC systems increasingly emphasize accelerators, particularly GPUs, as autonomous execution units, shifting control of entire program execution to GPUs. In this work, we aim to bridge this gap with a compiler and provide a productive method for writing efficient GPU-first code. We design and develop a code generator that efficiently fuses and schedules persistent kernels, provides high-level abstractions over device resources, and enables GPU-initiated communication within Python code using NVSHMEM to realize autonomous multi-GPU execution. We compare our implementation to other accelerated Python compilers including CuPy, DaCe, and cuNumeric on 22 NPBench kernels. We additionally perform a scaling study of distributed 2D/3D Jacobi and observe a speedup of 6.1𝑥 and 30.8𝑥 over DaCe and cuNumeric, respectively, on 8 GPUs for the 3D case with a scaling efficiency of 98%.