Close

Presentation

Training Large-Scale Vision Transformer Foundation Models for Science and Engineering Applications
DescriptionVision Transformer (ViT) is a powerful AI architecture for computer vision that is used by most imaging foundation models due to its effectiveness in discerning complex visual patterns across many tasks. However, training large-scale ViT foundation models requires considerable computing resources, leading to a significant energy footprint for training. For example, Open-AI’s SORA video generator model was trained on more than 10,000 NVIDIA H100 GPUs and the training took more than a month on a supercomputer. The energy consumption for training SORA was equivalent to the total annual energy consumption of 300 US households. This project aims to co-design the scaling algorithm and the ViT architecture to achieve hardware-, modality-, and energy-conscious computing for ViT foundation models. We anticipate that our proposed training approaches can not only significantly improve energy efficiency and reduce carbon footprint, but also significantly improve computing efficiency and scalability, fostering an accelerated AI development cycle.