Close

Presentation

TurboMoE: Enhancing MoE Training with Optimized Gating and Efficient Parallelization
DescriptionThe Mixture of Experts (MoE) model has emerged as a scalable solution for large-scale machine learning tasks, thanks to its dynamic expert selection. However, the gating mechanism that controls this selection together with all-2-all collectives can create significant computation and communication bottlenecks. In this talk, we present TurboMoE, a novel approach to accelerate MoE model training. TurboMoE employs innovative kernel-fusion and data-layout transformations to streamline the gating process, along with a new parallelization layout that minimizes communication overhead. We also present a re-engineered MoE architecture, which is employed at Snowflake’s Arctic to enable overlapping communication with parallel computation, leading to a more efficient training process.