Presentation
KVSort: Drastically Improving LLM Inference Performance via KV Cache Compression
DescriptionLarge language model (LLM) deployment necessitates high inference throughput due to the increasing demand for text generation. To accelerate inference, the prefill mechanism avoids repeated computations via introducing KV Cache (in HBM). However, the KV cache size increases with the input and generated text length, causing insufficient GPU memory and slow KV fetching. To address these issues, existing approaches compress the KV cache using prune-based mechanisms that only keep the important KV vectors in the cache. However, their compression ratio is limited because it is necessary to preserve inference accuracy in the accuracy-compression ratio tradeoff. To improve the compression ratio, we introduce KVSort, a novel framework that utilizes error-bounded lossy compression on sorted KV vectors. The evaluation shows that KVSort achieves up to 52x compression ratio and 6.8x end-to-end inference performance improvement, compared to a state-of-the-art approach that achieves 20x compression ratio and 5.5x end-to-end inference throughput.

Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX