Presentation
Data Layout Optimizations for Tensor Applications
DescriptionThe performance of tensor applications is often bottlenecked by data movement across the memory subsystem. This dissertation contributes domain-specific programming frameworks (compilers and runtime systems) that optimize data movement in tensor applications. We develop novel execution reordering and data reorganization techniques, achieving performance portability along with improved programmability.
We present BrickDL, a compiler framework that performs "merged execution" of fused deep learning operators as graph-level optimization. We employ fine-grained data blocking with "bricks" — a data layout of small, fixed-size blocks of contiguously packed data that enhance on-chip data locality on GPUs. BrickDL demonstrates up to 18% improved performance and 16% reduced DRAM data movement compared to existing deep learning frameworks for prominent models on NVIDIA and AMD GPUs.
The sequence of layers in neural networks is analogous to the nested hierarchy of grids in the Geometric Multigrid (GMG) iterative solver. The series of stencil calculations in the GMG V-cycle results in its memory-bound performance. We hence extend the optimizations in BrickDL to BrickGMG, a framework for restructuring computations and exploiting inter-operator reuse in the V-cycle. BrickGMG provides performance portability across NVIDIA, AMD, Intel GPUs, achieving 55% speedup over HPGMG and 73% of Roofline performance on average.
We develop MLTT, a compiler optimization pipeline in LLVM MLIR for arbitrary tensor transpositions, which are the primary performance bottleneck in tensor contractions for transforming data layouts. MLTT is portable across various CPU vector instruction sets. We integrate MLTT with COMET, an MLIR-based compiler, and present speedups of >40% for memory-bound tensor contractions.
We present BrickDL, a compiler framework that performs "merged execution" of fused deep learning operators as graph-level optimization. We employ fine-grained data blocking with "bricks" — a data layout of small, fixed-size blocks of contiguously packed data that enhance on-chip data locality on GPUs. BrickDL demonstrates up to 18% improved performance and 16% reduced DRAM data movement compared to existing deep learning frameworks for prominent models on NVIDIA and AMD GPUs.
The sequence of layers in neural networks is analogous to the nested hierarchy of grids in the Geometric Multigrid (GMG) iterative solver. The series of stencil calculations in the GMG V-cycle results in its memory-bound performance. We hence extend the optimizations in BrickDL to BrickGMG, a framework for restructuring computations and exploiting inter-operator reuse in the V-cycle. BrickGMG provides performance portability across NVIDIA, AMD, Intel GPUs, achieving 55% speedup over HPGMG and 73% of Roofline performance on average.
We develop MLTT, a compiler optimization pipeline in LLVM MLIR for arbitrary tensor transpositions, which are the primary performance bottleneck in tensor contractions for transforming data layouts. MLTT is portable across various CPU vector instruction sets. We integrate MLTT with COMET, an MLIR-based compiler, and present speedups of >40% for memory-bound tensor contractions.