Presentation
Eve: Less Memory, Same Might
DescriptionAdaptive optimizers, which adjust the learning rate for individual parameters, have become the standard for training deep neural networks. AdamW is a popular adaptive method that maintains two optimizer state values (momentum and variance) per parameter, doubling the model’s memory usage during training. Many proposed memory efficient optimizers claim to match AdamW’s performance but lack its desirable qualities such as robustness to learning rate changes. This quality is especially desirable when pre-training LLMs, where experimenting with different hyperparameters is infeasible. We propose Eve, a Memory Efficient AdaptiVe Moment Estimation algorithm that saves memory by reducing the variance term while also preserving AdamW’s desirable properties across different training settings. We fine-tune Llama 2 70B on 64 GPUs and show memory savings of 20% compared to AdamW. We also compare our method to a recent well-received memory efficient optimizer called Adam-mini and demonstrate better training stability across various learning rates.

Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX