BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20250626T234541Z
LOCATION:B302-B305
DTSTART;TZID=America/New_York:20241119T120000
DTEND;TZID=America/New_York:20241119T170000
UID:submissions.supercomputing.org_SC24_sess487_post144@linklings.com
SUMMARY:LM-Offload: Performance Model-Guided Generative Inference of Large
  Language Models with Parallelism Control
DESCRIPTION:Jianbo Wu (University of California, Merced); Jie Ren (College
  of William & Mary); Shuangyan Yang (University of California, Merced); Ko
 nstantinos Parasyris, Giorgis Georgakoudis, and Ignacio Laguna (Lawrence L
 ivermore National Laboratory (LLNL)); and Dong Li (University of Californi
 a, Merced)\n\nLarge language models (LLMs) have achieved remarkable succes
 s in various natural language processing tasks. However, LLM inference is 
 highly computational and memory-intensive, creating extreme deployment cha
 llenges. Tensor offloading, combined with tensor quantization and asynchro
 nous task execution, provides a feasible solution by utilizing host memory
  to enable large-scale LLM inference with a limited number of GPUs. Howeve
 r, existing approaches struggle to fully utilize all available computation
 al and memory resources due to a lack of consideration for (1) whether to 
 use quantization effectively, and (2) managing thread-level parallelism wi
 thin and across tasks. As a result, these approaches provide suboptimal so
 lutions. In this paper, we introduce LM-Offload, a framework that addresse
 s the above challenges by leveraging performance modeling and parallelism 
 control. Experimental results demonstrate that LM-Offload outperforms Flex
 Gen and ZeRO-Inference, two state-of-the-art systems for LLM inference, by
  up to 2.95× (2.34× on average) and 2.88× (1.57× on average), respectively
 , in inference throughput.\n\nRegistration Category: Tech Program Reg Pass
 , Exhibits Reg Pass\n\nSession Chairs: Ayesha Afzal (Friedrich-Alexander U
 niversity, Erlangen-Nuremberg; Erlangen National High Performance Computin
 g Center); Sally Ellingson (University of Kentucky); and Alan Sussman (Uni
 versity of Maryland)\n\n
END:VEVENT
END:VCALENDAR
