Presentation
Accelerating an overhead-sensitive atmospheric model on GPUs using asynchronous execution and kernel fusion
DescriptionMethods to mitigate the kernel launch overhead, one of drawbacks of GPUs, were implemented to an overhead-sensitive atmospheric model using OpenACC and CUDA and were evaluated. OpenACC enables kernels to run asynchronously in either one or multiple GPU queues. Moreover, CUDA allows different loops to be collocated in one kernel by branching operations based on block indices. While the default synchronous execution on A100 GPU lagged behind the A64FX CPU in strong scaling, the single-queue asynchronous execution reduced the total model runtime by 37, and the kernel fusion of the core application component further accelerated the entire model by approximately 10. In overhead-sensitive applications, the single-queue asynchronous execution is recommended because it can be easily implemented and maintained. If a small number of kernels are executed particularly frequently, it would be worth the effort to eliminate synchronizations and introduce CUDA Graphs, or bundle kernels using CUDA.