Presentation
SIGN IN TO VIEW THIS PRESENTATION Sign In
Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching
DescriptionDeep Learning Recommendation Models (DLRMs) face challenges due to the high memory needs of embedding tables and significant communication overhead in distributed settings. Traditional methods, like Tensor-Train (TT) decomposition, compress these tables effectively but add computational load. Furthermore, existing frameworks for distributed training are inadequate due to the excessive data exchange requirements.
We introduce EcoRec, an advanced library that boosts DLRM training by integrating TT decomposition with distributed training. EcoRec innovates with a unique computation pattern to streamline TT operations and an optimized multiplication approach, drastically cutting computation time. It implements a novel micro-batching method using sorted indices to slash memory use without extra computation. Moreover, EcoRec employs a pioneering pipeline for embedding layers, promoting even data spread and communication efficiency. Built on PyTorch and CUDA, tested on a 32 GPU cluster, EcoRec dramatically surpasses EL-Rec, delivering up to 3.1× faster training and reducing memory needs by 38.5%.
We introduce EcoRec, an advanced library that boosts DLRM training by integrating TT decomposition with distributed training. EcoRec innovates with a unique computation pattern to streamline TT operations and an optimized multiplication approach, drastically cutting computation time. It implements a novel micro-batching method using sorted indices to slash memory use without extra computation. Moreover, EcoRec employs a pioneering pipeline for embedding layers, promoting even data spread and communication efficiency. Built on PyTorch and CUDA, tested on a 32 GPU cluster, EcoRec dramatically surpasses EL-Rec, delivering up to 3.1× faster training and reducing memory needs by 38.5%.