BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T143140Z
LOCATION:B304
DTSTART;TZID=America/New_York:20241117T164500
DTEND;TZID=America/New_York:20241117T170000
UID:submissions.supercomputing.org_SC24_sess745_ws_scalah101@linklings.com
SUMMARY:Accelerating an overhead-sensitive atmospheric model on GPUs using
  asynchronous execution and kernel fusion
DESCRIPTION:Kazuya Yamazaki (The University of Tokyo, Japan)\n\nMethods to
  mitigate the kernel launch overhead, one of drawbacks of GPUs, were imple
 mented to an overhead-sensitive atmospheric model using OpenACC and CUDA a
 nd were evaluated. OpenACC enables kernels to run asynchronously in either
  one or multiple GPU queues. Moreover, CUDA allows different loops to be c
 ollocated in one kernel by branching operations based on block indices. Wh
 ile the default synchronous execution on A100 GPU lagged behind the A64FX 
 CPU in strong scaling, the single-queue asynchronous execution reduced the
  total model runtime by 37, and the kernel fusion of the core application 
 component further accelerated the entire model by approximately 10. In ove
 rhead-sensitive applications, the single-queue asynchronous execution is r
 ecommended because it can be easily implemented and maintained. If a small
  number of kernels are executed particularly frequently, it would be worth
  the effort to eliminate synchronizations and introduce CUDA Graphs, or bu
 ndle kernels using CUDA.\n\nTag: Algorithms, Heterogeneous Computing\n\nRe
 gistration Category: Workshop Reg Pass\n\nSession Chairs: Vassil Alexandro
 v (Hartree Centre, STFC); Jack Dongarra (University of Tennessee, Knoxvill
 e; Oak Ridge National Laboratory (ORNL)); Erik Draeger (Lawrence Livermore
  National Laboratory (LLNL), Center for Applied Scientific Computing); Chr
 istian Engelmann (Oak Ridge National Laboratory (ORNL)); and Dieter A. Kra
 nzlmueller (Ludwig-Maxmilians-Universität München, Leibniz Supercomputing 
 Centre (LRZ))\n\n
END:VEVENT
END:VCALENDAR
