BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20250626T233532Z
LOCATION:B302-B305
DTSTART;TZID=America/New_York:20241121T100000
DTEND;TZID=America/New_York:20241121T170000
UID:submissions.supercomputing.org_SC24_sess534_post251@linklings.com
SUMMARY:Uncover the Overhead and Resource Usage for Handling KV Cache Over
 flow in LLM Inference
DESCRIPTION:Jie Ye (Illinois Institute of Technology), Bogdan Nicolae (Arg
 onne National Laboratory (ANL)), and Anthony Kougkas and Xian-He Sun (Illi
 nois Institute of Technology)\n\nLLM inference includes two phases: prefil
 l phase and decode phase. Prefill phase processes input tokens simultaneou
 sly to generate the first token. Decode phases generate the subsequent tok
 ens one after another until they either meet a termination or reach the ma
 x length. To avoid recomputation, the Key-Value (KV) cache has become a st
 andard approach used for storing previously computed keys and values. Thro
 ughout LLM inference, the KV cache memory space grows linearly with contex
 t-length and batch sizes, easily running out of the GPU memory of an insta
 nce. SOAT LLM inference usually uses recomputation/swap to handle KV cache
  overflow. Both recomputation and swap introduce overhead. However, the ov
 erhead of these strategies and the resource utilization over time during L
 LM inference have not been explored. This work aims to fill this gap by qu
 antifying the overhead of recomputation/swapping, and analyzing the resour
 ce utilization during LLM inference to derive insights.\n\nRegistration Ca
 tegory: Tech Program Reg Pass, Exhibits Reg Pass\n\nSession Chairs: Ayesha
  Afzal (Friedrich-Alexander University, Erlangen-Nuremberg; Erlangen Natio
 nal High Performance Computing Center); Sally Ellingson (University of Ken
 tucky); and Alan Sussman (University of Maryland)\n\n
END:VEVENT
END:VCALENDAR
