BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20250626T233526Z
LOCATION:B302-B305
DTSTART;TZID=America/New_York:20241120T100000
DTEND;TZID=America/New_York:20241120T170000
UID:submissions.supercomputing.org_SC24_sess533_post189@linklings.com
SUMMARY:KVSort: Drastically Improving LLM Inference Performance via KV Cac
 he Compression
DESCRIPTION:Baixi Sun (Indiana University); Dingwen Tao (Institute of Comp
 uting Technology, Chinese Academy of Sciences); Xiaodong Yu (Stevens Insti
 tute of Technology); and Fengguang Song (Indiana University)\n\nLarge lang
 uage model (LLM) deployment necessitates high inference throughput due to 
 the increasing demand for text generation. To accelerate inference, the pr
 efill mechanism avoids repeated computations via introducing KV Cache (in 
 HBM). However, the KV cache size increases with the input and generated te
 xt length, causing insufficient GPU memory and slow KV fetching. To addres
 s these issues, existing approaches compress the KV cache using prune-base
 d mechanisms that only keep the important KV vectors in the cache. However
 , their compression ratio is limited because it is necessary to preserve i
 nference accuracy in the accuracy-compression ratio tradeoff. To improve t
 he compression ratio, we introduce KVSort, a novel framework that utilizes
  error-bounded lossy compression on sorted KV vectors. The evaluation show
 s that KVSort achieves up to 52x compression ratio and 6.8x end-to-end inf
 erence performance improvement, compared to a state-of-the-art approach th
 at achieves 20x compression ratio and 5.5x end-to-end inference throughput
 .\n\nRegistration Category: Tech Program Reg Pass, Exhibits Reg Pass\n\nSe
 ssion Chairs: Ayesha Afzal (Friedrich-Alexander University, Erlangen-Nurem
 berg; Erlangen National High Performance Computing Center); Sally Ellingso
 n (University of Kentucky); and Alan Sussman (University of Maryland)\n\n
END:VEVENT
END:VCALENDAR
