Close

Presentation

Accelerating Multi-GPU Embedding Retrieval with PGAS style Communication for Deep Learning Recommendation Systems
DescriptionIn this paper, we propose using Partitioned Global Address Space (PGAS) GPU one-sided asynchronous small messages to replace the widely used collective communication calls for sparse input multi-GPU embedding retrieval in deep learning recommendation systems. This GPU PGAS communication approach achieves (1) better communication and computation overlap, (2) smoother network usage, and (3) reduced overhead (due to the data unpack and rearrangement steps associated with collective communication calls). We implement a CUDA embedding retrieval backend for PyTorch that supports the proposed PGAS communication scheme and evaluate it on deep learning recommendation inference passes. Our backend outperforms the baseline using NCCL collective calls, achieving 1.97x speedup for the weak scaling test and 2.63x speedup for the strong scaling test in a 4 GPU NVLink-connected system.
Event Type
Workshop
TimeSunday, 17 November 202411:30am - 11:50am EST
LocationB306
Tags
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
PAW-Full
Task Parallelism
Registration Categories
W