Close

Presentation

Distributed Deep Learning on GPU-Based Clusters
DescriptionDeep learning (DL) is rapidly becoming pervasive in almost all areas of computer science, and is even being used to assist computational science simulations and data analysis. A key behavior of these deep neural networks (DNNs) is that they reliably scale i.e., they continuously improve in performance when the number of model parameters and amount of data grow. As the demand for larger, more sophisticated, and more accurate DL models increases, the need for large-scale parallel model training, fine-tuning and inference has become increasingly pressing. Subsequently, in the past few years, several parallel algorithms and frameworks have been developed to parallelize model training and fine-tuning on GPU-based platforms. This tutorial will introduce and provide basics of the state-of-the-art in distributed DL. We will use large language models (LLMs) as a running example, and provide hands-on training in performing three essential tasks for working with DNNs: i. training a DNN from scratch, ii. continued training/fine-tuning of a DDN from a checkpoint, and iii. inference on a trained DNN. We will cover algorithms and frameworks that employ data parallelism (PytorchDDP and DeepSpeed), and model parallelism (AxoNN).