Close

Presentation

Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching
DescriptionDeep Learning Recommendation Models (DLRMs) face challenges due to the high memory needs of embedding tables and significant communication overhead in distributed settings. Traditional methods, like Tensor-Train (TT) decomposition, compress these tables effectively but add computational load. Furthermore, existing frameworks for distributed training are inadequate due to the excessive data exchange requirements.
We introduce EcoRec, an advanced library that boosts DLRM training by integrating TT decomposition with distributed training. EcoRec innovates with a unique computation pattern to streamline TT operations and an optimized multiplication approach, drastically cutting computation time. It implements a novel micro-batching method using sorted indices to slash memory use without extra computation. Moreover, EcoRec employs a pioneering pipeline for embedding layers, promoting even data spread and communication efficiency. Built on PyTorch and CUDA, tested on a 32 GPU cluster, EcoRec dramatically surpasses EL-Rec, delivering up to 3.1× faster training and reducing memory needs by 38.5%.
Event Type
Paper
TimeWednesday, 20 November 20241:30pm - 2pm EST
LocationB308
Tags
Accelerators
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Distributed Computing
Graph Algorithms
Heterogeneous Computing
Tensors
Registration Categories
TP