Presentation
LM-Offload: Performance Model-Guided Generative Inference of Large Language Models with Parallelism Control
DescriptionLarge language models (LLMs) have achieved remarkable success in various natural language processing tasks. However, LLM inference is highly computational and memory-intensive, creating extreme deployment challenges. Tensor offloading, combined with tensor quantization and asynchronous task execution, provides a feasible solution by utilizing host memory to enable large-scale LLM inference with a limited number of GPUs. However, existing approaches struggle to fully utilize all available computational and memory resources due to a lack of consideration for (1) whether to use quantization effectively, and (2) managing thread-level parallelism within and across tasks. As a result, these approaches provide suboptimal solutions. In this paper, we introduce LM-Offload, a framework that addresses the above challenges by leveraging performance modeling and parallelism control. Experimental results demonstrate that LM-Offload outperforms FlexGen and ZeRO-Inference, two state-of-the-art systems for LLM inference, by up to 2.95× (2.34× on average) and 2.88× (1.57× on average), respectively, in inference throughput.

Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX