Presentation
RecFlex: Enabling Feature Heterogeneity-Aware Optimization for Deep Recommendation Models with Flexible Schedules
SessionGPU Optimizations for ML
DescriptionIndustrial recommendation models typically involve numerous feature fields. The embedding computation workloads are heterogeneous across these fields, thus requiring varied optimal code schedules. While existing solutions apply basic fusion optimization for embedding operations, they inefficiently treat all feature-fields with identical schedules, leading to suboptimal performance. In this paper, we introduce RecFlex, which generates fused kernels with distinct schedules for different feature-fields. RecFlex employs the interference-aware schedule tuner to tune schedules and the heterogeneous schedule fusion compiler to generate fused kernels, addressing two major challenges. To determine optimal schedules of different feature-fields within the fused kernel, RecFlex proposes a two-stage interference-simulated tuning strategy. To handle dynamic workloads that challenge tuning and fusion, RecFlex combines compile-time schedule tuning with runtime kernel thread mapping. RecFlex surpasses state-of-the-art libraries and compilers, achieving average speedups of 2.64×, 20.77×, and 11.31× over TorchRec, HugeCTR, and RECom, respectively. RecFlex is publicly available at https://github.com/PanZaifeng/RecFlex.