BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T143138Z
LOCATION:B314
DTSTART;TZID=America/New_York:20241117T110000
DTEND;TZID=America/New_York:20241117T113000
UID:submissions.supercomputing.org_SC24_sess741_ws_mlg106@linklings.com
SUMMARY:Acceleration of Graph Neural Networks with Heterogenous Accelerato
 rs Architecture
DESCRIPTION:Kaiwen Cao (University of Illinois Urbana-Champaign, Hewlett P
 ackard Labs); Archit Gajjar (Hewlett Packard Labs); Liad Gerstman (Technio
 n - Israel Institute of Technology); Kun Wu (University of Illinois Urbana
 -Champaign); Sai Rahul Chalamalasetti (d-Matrix); Aditya Dhakal, Giacomo P
 edretti, and Pavana Prakash (Hewlett Packard Labs); Wen-mei Hwu (Universit
 y of Illinois Urbana-Champaign, NVIDIA Corporation); Deming Chen (Universi
 ty of Illinois Urbana-Champaign); and Dejan Milojicic (Hewlett Packard Lab
 s)\n\nGraph Neural Networks (GNNs) have been used to solve complex problem
 s of drug discovery, social media analysis, etc. Meanwhile, GPUs are becom
 ing dominating accelerators to improve deep neural network performance. Ho
 wever, due to the characteristics of graph data, it is challenging to acce
 lerate GNN-type workloads with GPUs alone. GraphSAGE is one representative
  GNN workload that uses sampling to improve GNN learning efficiency. Profi
 ling the GraphSAGE using PyG library reveals that the sampling stage on th
 e CPU is the bottleneck. Hence, we propose a heterogeneous system architec
 ture solution with the sampling algorithm accelerated on customizable acce
 lerators (FPGA), and feed sampled data into GPU training through a PCIe Pe
 er-to-Peer (P2P) communication flow. With FPGA acceleration, for the sampl
 ing stage alone, we achieve a speed-up of 2.38X to 8.55X compared with sam
 pling on CPU. \nFor end-to-end latency, compared with the traditional flow
 , we achieve a speed-up of 1.24X to 1.99X.\n\nTag: Artificial Intelligence
 /Machine Learning, Graph Algorithms, Scalable Data Mining\n\nRegistration 
 Category: Workshop Reg Pass\n\nSession Chairs: Seung-Hwan Lim (Oak Ridge N
 ational Laboratory (ORNL)); José Moreira (IBM); Catherine Schuman (Univers
 ity of Tennessee, Knoxville); and Richard Vuduc (Georgia Institute of Tech
 nology)\n\n
END:VEVENT
END:VCALENDAR
