Close

Presentation

AI and Scientific Research Computing with Kubernetes
DescriptionKubernetes has emerged as the leading container orchestration solution (maintained by Cloud Native Computing Foundation) that works on resources ranging from on-prem clusters to commercial clouds. The Kubernetes ecosystem has been growing to enable batch type workflows and developed rich semantics that allow execution of complex scientific computing workflows typically not feasible on batch systems. This growth has been possible thanks to tool like k8s-sig/kueue, the kubeflow/mpi-operator, k8s/scheduler-plugins, k8s/device-plugin, among many projects that have been created by the community to enable complex workloads leveraging Kubernetes rich API system.

The tutorial aims to educate AI and computational science researchers on Kubernetes as a resource management system, comparing it with traditional batch systems. It provides information on IO/storage options and utilizing GPU and MPI operators in Kubernetes to scale workloads leveraging high-performance networks like InfiniBand. Attendees will receive an overview of Kubernetes architecture; job submission procedures, learn about storage options; run various AI inference, training, and scientific research software hands-on examples using Kubernetes on CPU and GPU resources, and explore MPI examples for scaling out. Theoretical knowledge will be reinforced with hands-on sessions using the PNRP production Kubernetes cluster Nautilus.
Event Type
Tutorial
TimeSunday, 17 November 20248:30am - 5pm EST
LocationB206
Tags
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Portability
Runtime Systems
Registration Categories
TUT