Presentation
SIGN IN TO VIEW THIS PRESENTATION Sign In
CUDASTF: Bridging the Gap Between CUDA and Task Parallelism
DescriptionOrganizing computation as asynchronous tasks with data-driven dependencies is a simple and efficient model for single-and-multi-GPU programs. Sequential Task Flow (STF) is such a model, which derives task graphs from data dependencies.
We propose CUDASTF, a C++ library that implements STF over CUDA APIs, fostering easy creation of scalable and composable algorithms. Users may easily elect to use CUDA graphs instead of streams if needed. Structured kernels spanning multiple devices can exercise fine-grained control of affinity.
Implementation-wise, CUDASTF makes a compelling argument for an event-based approach to asynchronous parallel libraries. We obtain up to a 1.8x improvement over the cusolverMg library on Cholesky decomposition. On a small weather simulation task we demonstrate near-optimal scalability of our multi-gpu kernels; also, on a single GPU, CUDA graphs improve performance by up to 30%. Finally, we were able to author the first implementation of the CKKS Fully Homomorphic Encryption scheme over multiple devices.
We propose CUDASTF, a C++ library that implements STF over CUDA APIs, fostering easy creation of scalable and composable algorithms. Users may easily elect to use CUDA graphs instead of streams if needed. Structured kernels spanning multiple devices can exercise fine-grained control of affinity.
Implementation-wise, CUDASTF makes a compelling argument for an event-based approach to asynchronous parallel libraries. We obtain up to a 1.8x improvement over the cusolverMg library on Cholesky decomposition. On a small weather simulation task we demonstrate near-optimal scalability of our multi-gpu kernels; also, on a single GPU, CUDA graphs improve performance by up to 30%. Finally, we were able to author the first implementation of the CKKS Fully Homomorphic Encryption scheme over multiple devices.