Presentation
Uncover the Overhead and Resource Usage for Handling KV Cache Overflow in LLM Inference
DescriptionLLM inference includes two phases: prefill phase and decode phase. Prefill phase processes input tokens simultaneously to generate the first token. Decode phases generate the subsequent tokens one after another until they either meet a termination or reach the max length. To avoid recomputation, the Key-Value (KV) cache has become a standard approach used for storing previously computed keys and values. Throughout LLM inference, the KV cache memory space grows linearly with context-length and batch sizes, easily running out of the GPU memory of an instance. SOAT LLM inference usually uses recomputation/swap to handle KV cache overflow. Both recomputation and swap introduce overhead. However, the overhead of these strategies and the resource utilization over time during LLM inference have not been explored. This work aims to fill this gap by quantifying the overhead of recomputation/swapping, and analyzing the resource utilization during LLM inference to derive insights.

Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX