BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20250626T234542Z
LOCATION:B213
DTSTART;TZID=America/New_York:20241118T083000
DTEND;TZID=America/New_York:20241118T120000
UID:submissions.supercomputing.org_SC24_sess434_tut108@linklings.com
SUMMARY:Core-Level Performance Engineering
DESCRIPTION:Jan Laukemann (University of Erlangen-Nuremberg, Germany; Erla
 ngen National High Performance Computing Center) and Georg Hager (Erlangen
  National High Performance Computing Center)\n\nWhile many developers put 
 a lot of effort into optimizing large-scale parallelism, they often neglec
 t the importance of an efficient serial code. Even worse, slow serial code
  tends to scale very well, hiding the fact that resources are wasted becau
 se no definite hardware performance limit (“bottleneck”) is exhausted. Thi
 s tutorial conveys the required knowledge to develop a thorough understand
 ing of the interactions between software and hardware on the level of a si
 ngle-CPU core and the lowest memory hierarchy level (the L1 cache). We int
 roduce general out-of-order core architectures and their typical performan
 ce bottlenecks using modern x86-64 (Intel Sapphire Rapids) and ARM (Fujits
 u A64FX) processors as examples. We then go into detail about x86 assembly
  code, specifically including vectorization (SIMD), pipeline utilization, 
 critical paths, and loop-carried dependencies. We also demonstrate perform
 ance analysis and performance engineering using the Open-Source Architectu
 re Code Analyzer (OSACA) in combination with a dedicated instance of the w
 ell-known Compiler Explorer. Various hands-on exercises will allow attende
 es to make their own experiments and measurements and identify in-core per
 formance bottlenecks. Furthermore, we show real-life use-cases from comput
 ational science (sparse solvers) to emphasize how profitable in-core perfo
 rmance engineering can be.\n\nTag: Architecture, Broader Engagement, Perfo
 rmance Evaluation and/or Optimization Tools, Portability\n\nRegistration C
 ategory: Tutorial Reg Pass\n\n
END:VEVENT
END:VCALENDAR
