Close

Presentation

MCFuser: High-performance and Rapid-fusion of Memory-bound Compute-intensive Operators
DescriptionOperator fusion enhances data locality and reduces GPU memory bandwidth pressure. but struggles with multiple compute-intensive operators due to saturated computation throughput. The variability in tensor sizes can make these operators memory-bound, necessitating efficient fused kernel generation, challenged by limited scheduling spaces, redundant accesses, and long tuning times.
We present MCFuser, a framework that efficiently generates high-performance fused kernels for memory-bound compute-intensive (MBCI) operator chains. MCFuser uses high-level tiling expressions for expansive search space delineation and Directed Acyclic Graph (DAG) analysis to cut redundant memory access, optimizing kernel performance. It prunes the search space with specific guidelines and combines an analytical performance model with heuristic search, significantly speeding up tuning. In tests with NVIDIA A100 and RTX3080 GPUs, MCFuser outperformed leading compilers like Ansor, delivering up to 5.9x kernel speedup and reducing tuning time by over 70-fold, proving its effectiveness and efficiency in enhancing kernel performance.
Event Type
Paper
TimeTuesday, 19 November 20243:30pm - 4pm EST
LocationB309
Tags
Accelerators
Compilers
Heterogeneous Computing
Performance Evaluation and/or Optimization Tools
Registration Categories
TP