BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T143139Z
LOCATION:B306
DTSTART;TZID=America/New_York:20241121T163000
DTEND;TZID=America/New_York:20241121T164500
UID:submissions.supercomputing.org_SC24_sess530_drs118@linklings.com
SUMMARY:Algorithmic and Optimization Techniques for Graph Applications in 
 Heterogeneous Systems at Scale
DESCRIPTION:Reece Neff (North Carolina State University, Pacific Northwest
  National Laboratory (PNNL))\n\nAs heterogeneity becomes commonplace in HP
 C systems, algorithmic and optimization techniques are needed to address t
 he challenges that come with it, especially for irregular applications. Th
 is includes workload balancing, scheduling, latency tolerance, and memory 
 utilization and contention, among others. This showcase covers three works
  addressing key questions in running complex irregular graph applications 
 on heterogeneous systems: programmability, performance portability, memory
  efficiency, load balancing, and scalability.\n\nThe first work explores t
 he efficacy of utilizing commercial high-level synthesis tools to accelera
 te two different graph sampling methods on FPGAs. We achieve up to a 40x s
 peedup compared to the baseline OpenCL kernel, and identify key areas for 
 toolchain improvements, such as memory subsystems and latency tolerance.\n
 \nThe second work focuses on improving breadth-first probabilistic travers
 als (BPTs), as they dominate runtime in some applications. By identifying 
 and exploiting redundancies in edge accesses, we achieve an average of 75x
  and 135x speedups when deployed on two different frameworks. We also demo
 nstrate strong scaling up to 4,096 nodes on OLCF Frontier enabled by CPU-G
 PU heterogeneous workload balancing.\n\nThe third work is currently in pro
 gress, exploring the use of lossy compression to enable training on graph 
 neural networks. We have promising preliminary results, showing a compress
 ion ratio of between 6x-20x with minimal accuracy loss on both GCN and GAT
 . We identify future directions and use cases for this method with an emph
 asis on systems integration such as larger batch sizes in mini-batch train
 ing, compressing feature vector caches, and adaptive compression methods f
 or heterogeneous and dynamic GNNs.\n\nRegistration Category: Tech Program 
 Reg Pass\n\nSession Chair: Ian Lumsden (University of Tennessee, Knoxville
 )\n\n
END:VEVENT
END:VCALENDAR
