BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20250626T234543Z
LOCATION:B306
DTSTART;TZID=America/New_York:20241117T113000
DTEND;TZID=America/New_York:20241117T115000
UID:submissions.supercomputing.org_SC24_sess734_ws_pawatm104@linklings.com
SUMMARY:Accelerating Multi-GPU Embedding Retrieval with PGAS style Communi
 cation for Deep Learning Recommendation Systems
DESCRIPTION:Yuxin Chen (University of California, Davis); Aydin Buluc (Law
 rence Berkeley National Laboratory (LBNL)); Katherine Yelick (University o
 f California, Berkeley); and John Owens (University of California, Davis)\
 n\nIn this paper, we propose using Partitioned Global Address Space (PGAS)
  GPU one-sided asynchronous small messages to replace the widely used coll
 ective communication calls for sparse input multi-GPU embedding retrieval 
 in deep learning recommendation systems. This GPU PGAS communication appro
 ach achieves (1) better communication and computation overlap, (2) smoothe
 r network usage, and (3) reduced overhead (due to the data unpack and rear
 rangement steps associated with collective communication calls). We implem
 ent a CUDA embedding retrieval backend for PyTorch that supports the propo
 sed PGAS communication scheme and evaluate it on deep learning recommendat
 ion inference passes. Our backend outperforms the baseline using NCCL coll
 ective calls, achieving 1.97x speedup for the weak scaling test and 2.63x 
 speedup for the strong scaling test in a 4 GPU NVLink-connected system.\n\
 nTag: Heterogeneous Computing, Parallel Programming Methods, Models, Langu
 ages and Environments, PAW-Full, Task Parallelism\n\nRegistration Category
 : Workshop Reg Pass\n\nSession Chairs: Engin Kayraklioglu (Hewlett Packard
  Enterprise (HPE)); Daniele Lezzi (Barcelona Supercomputing Center (BSC));
  Karla Vanessa Morris Wright (Sandia National Laboratories); Irene Moulits
 as (Cranfield University); Elliott Slaughter (SLAC National Accelerator La
 boratory); and Kenjiro Taura (The University of Tokyo, Japan)\n\n
END:VEVENT
END:VCALENDAR
