BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T143140Z
LOCATION:B312
DTSTART;TZID=America/New_York:20241118T090500
DTEND;TZID=America/New_York:20241118T100000
UID:submissions.supercomputing.org_SC24_sess811_misc270@linklings.com
SUMMARY:From Tensor Processing Primitive towards Tensor Compilers using up
 stream MLIR
DESCRIPTION:Alexander Heinecke (Intel Corporation)\n\nDuring the past deca
 de, Deep Learning (DL) algorithms, programming systems and hardware have c
 onverged with the High Performance Computing (HPC) counterparts. Neverthel
 ess, the programming methodology of DL and HPC systems is stagnant, relyin
 g on highly-optimized, yet platform-specific and inflexible vendor-optimiz
 ed libraries. Such libraries provide close-to-peak performance on specific
  platforms, kernels and shapes thereof that vendors have dedicated optimiz
 ations efforts, while they underperform in the remaining use-cases, yieldi
 ng non-portable codes with performance glass-jaws. This talk will shade li
 ght on abstraction efforts, mainly targeting CPUs and widening to GPUs the
  close the approaches get to DSLs/Compilers. We will introduce the Tensor 
 Processing Primitives (TPP) as an virtual and software-defined ISA abstrac
 tion in form of ukernels. Subsequently we will cover programming abstracti
 ons on top of TPP which is carried out in two steps: 1) Expressing the com
 putational core using Tensor Processing Primitives (TPPs): a compact, vers
 atile set of 2D-tensor operators, 2) Expressing the logical loops around T
 PPs in a high-level, declarative fashion whereas the exact instantiation (
 ordering, tiling, parallelization) is determined via simple knobs. We demo
 nstrate the efficacy of our approach using standalone kernels and end-to-e
 nd workloads that outperform state-of-the-art implementations on diverse C
 PU platforms. We will close the talk by demonstrating how TPP can be the a
 rchitectural target of a tensor compiler which in turn is then able to gen
 erate hand-coded performance.\n\nTag: I/O, Storage, Archive\n\nRegistratio
 n Category: Workshop Reg Pass\n\nSession Chairs: Glenn Brook (Cornelis Net
 works, University of Tennessee); Clayton Hughes (Sandia National Laborator
 ies); Nalini Kumar (Intel Corporation); Hatem Ltaief (King Abdullah Univer
 sity of Science and Technology (KAUST)); David Martin (Lawrence Berkeley N
 ational Laboratory (LBNL), Energy Sciences Network (ESnet)); and Amit Ruhe
 la (Texas Advanced Computing Center (TACC), University of Texas)\n\n
END:VEVENT
END:VCALENDAR
