BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T143006Z
LOCATION:B302-B305
DTSTART;TZID=America/New_York:20241119T120000
DTEND;TZID=America/New_York:20241119T170000
UID:submissions.supercomputing.org_SC24_sess487_drs110@linklings.com
SUMMARY:Data Layout Optimizations for Tensor Applications
DESCRIPTION:Mahesh Lakshminarasimhan (University of Utah)\n\nThe performan
 ce of tensor applications is often bottlenecked by data movement across th
 e memory subsystem. This dissertation contributes domain-specific programm
 ing frameworks (compilers and runtime systems) that optimize data movement
  in tensor applications. We develop novel execution reordering and data re
 organization techniques, achieving performance portability along with impr
 oved programmability.\n\nWe present BrickDL, a compiler framework that per
 forms "merged execution" of fused deep learning operators as graph-level o
 ptimization. We employ fine-grained data blocking with "bricks" — a data l
 ayout of small, fixed-size blocks of contiguously packed data that enhance
  on-chip data locality on GPUs. BrickDL demonstrates up to 18% improved pe
 rformance and 16% reduced DRAM data movement compared to existing deep lea
 rning frameworks for prominent models on NVIDIA and AMD GPUs.\n\nThe seque
 nce of layers in neural networks is analogous to the nested hierarchy of g
 rids in the Geometric Multigrid (GMG) iterative solver. The series of sten
 cil calculations in the GMG V-cycle results in its memory-bound performanc
 e. We hence extend the optimizations in BrickDL to BrickGMG, a framework f
 or restructuring computations and exploiting inter-operator reuse in the V
 -cycle. BrickGMG provides performance portability across NVIDIA, AMD, Inte
 l GPUs, achieving 55% speedup over HPGMG and 73% of Roofline performance o
 n average.\n \nWe develop MLTT, a compiler optimization pipeline in LLVM M
 LIR for arbitrary tensor transpositions, which are the primary performance
  bottleneck in tensor contractions for transforming data layouts. MLTT is 
 portable across various CPU vector instruction sets. We integrate MLTT wit
 h COMET, an MLIR-based compiler, and present speedups of >40% for memory-b
 ound tensor contractions.\n\nRegistration Category: Tech Program Reg Pass,
  Exhibits Reg Pass\n\nSession Chairs: Ayesha Afzal (Friedrich-Alexander-Un
 iversität Erlangen-Nürnberg, Erlangen National High Performance Computing 
 Center); Sally Ellingson (University of Kentucky); and Alan Sussman (Unive
 rsity of Maryland)\n\n
END:VEVENT
END:VCALENDAR
