Presentation
An Efficient Checkpointing System for Large Machine Learning Model Training
SessionCommunication, I/O, and Storage at Scale on Next-Generation Platforms – Scalable Infrastructures
DescriptionAs machine learning
models increase in size and complexity rapidly, the cost of
checkpointing in ML training became a bottleneck in storage
and performance (time). For example, the latest GPT-4 model
has massive parameters at the scale of 1.76 trillion. It is highly
time and storage consuming to frequently writes the model to
checkpoints with more than 1 trillion floating point values to
storage. This work aims to understand and attempt to mitigate
this problem. First, we characterize the checkpointing interface
in a collection of representative large machine learning/language
models with respect to storage consumption and performance
overhead. Second, we propose the two optimizations: i) A periodic
cleaning strategy that periodically cleans up outdated checkpoints
to reduce the storage burden; ii) A data staging optimization that
coordinates checkpoints between local and shared file systems for
performance improvement.
models increase in size and complexity rapidly, the cost of
checkpointing in ML training became a bottleneck in storage
and performance (time). For example, the latest GPT-4 model
has massive parameters at the scale of 1.76 trillion. It is highly
time and storage consuming to frequently writes the model to
checkpoints with more than 1 trillion floating point values to
storage. This work aims to understand and attempt to mitigate
this problem. First, we characterize the checkpointing interface
in a collection of representative large machine learning/language
models with respect to storage consumption and performance
overhead. Second, we propose the two optimizations: i) A periodic
cleaning strategy that periodically cleans up outdated checkpoints
to reduce the storage burden; ii) A data staging optimization that
coordinates checkpoints between local and shared file systems for
performance improvement.