Presentation
Keynote : Network and Communication Infrastructure powering Meta’s GenAI and Recommendation Systems
SessionCommunication, I/O, and Storage at Scale on Next-Generation Platforms – Scalable Infrastructures
DescriptionIn 2020, Meta changed the way we did AI Training. We moved to a synchronous training approach to power our recommendation systems. This pivot required us to build high speed low latency RDMA networks to interconnect GPUs. Over the years Meta has build some of the largest AI Clusters in the world to support training, increasing complex models to support rich user experience. We initially built with Ethernet as our interconnect, later also onboarded Infiniband to production. Such model complexity and scale increased an order of magnitude recently with evolution of GenAI, highlighted by our llama series of foundational models. In this talk, we will take you through such evolution of Meta’s AI Network and Communication Library software over the past 5 years. We will talk about the problems we ran into as we scaled such infrastructure and how we customized our training systems software stack to work through them. We will highlight the changes we did to the Scheduling, Collective Communication, Sharding and Network Transport layers to keep our Clusters performant from a communication perspective.