BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20250626T234541Z
LOCATION:B309
DTSTART;TZID=America/New_York:20241119T153000
DTEND;TZID=America/New_York:20241119T160000
UID:submissions.supercomputing.org_SC24_sess401_pap270@linklings.com
SUMMARY:MCFuser: High-performance and Rapid-fusion of Memory-bound Compute
 -intensive Operators
DESCRIPTION:Zheng Zhang (Wuhan University School of Computer Science); Don
 glin Yang (NVIDIA Corporation); Xiaobo Zhou (University of Macau, Departme
 nt of Computer and Information Sciences); and Dazhao Cheng (Wuhan Universi
 ty School of Computer Science)\n\nOperator fusion enhances data locality a
 nd reduces GPU memory bandwidth pressure. but struggles with multiple comp
 ute-intensive operators due to saturated computation throughput. The varia
 bility in tensor sizes can make these operators memory-bound, necessitatin
 g efficient fused kernel generation, challenged by limited scheduling spac
 es, redundant accesses, and long tuning times.\nWe present MCFuser, a fram
 ework that efficiently generates high-performance fused kernels for memory
 -bound compute-intensive (MBCI) operator chains. MCFuser uses high-level t
 iling expressions for expansive search space delineation and Directed Acyc
 lic Graph (DAG) analysis to cut redundant memory access, optimizing kernel
  performance. It prunes the search space with specific guidelines and comb
 ines an analytical performance model with heuristic search, significantly 
 speeding up tuning. In tests with NVIDIA A100 and RTX3080 GPUs, MCFuser ou
 tperformed leading compilers like Ansor, delivering up to 5.9x kernel spee
 dup and reducing tuning time by over 70-fold, proving its effectiveness an
 d efficiency in enhancing kernel performance.\n\nTag: Accelerators, Compil
 ers, Heterogeneous Computing, Performance Evaluation and/or Optimization T
 ools\n\nRegistration Category: Tech Program Reg Pass\n\nSession Chair: Sas
 cha Hunold (Technical University of Vienna)\n\n
END:VEVENT
END:VCALENDAR
