BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20250626T234542Z
LOCATION:B308
DTSTART;TZID=America/New_York:20241120T103000
DTEND;TZID=America/New_York:20241120T110000
UID:submissions.supercomputing.org_SC24_sess399_pap536@linklings.com
SUMMARY:PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined
  Speculation
DESCRIPTION:Branden Butler and Sixing Yu (Iowa State University), Arya Maz
 aheri (Technical University Darmstadt), and Ali Jannesari (Iowa State Univ
 ersity)\n\nInference of Large Language Models (LLMs) across computer clust
 ers has become a focal point of research in recent times, with many accele
 ration techniques taking inspiration from\nCPU speculative execution. Thes
 e techniques reduce bottlenecks associated with memory bandwidth, but also
  increase end-to-end latency per inference run, requiring high-speculation
  acceptance rates to improve performance. As a remedy, we propose PipeInfe
 r, a pipelined speculative acceleration technique to reduce inter-token la
 tency and improve system utilization for single-request scenarios, while a
 lso improving tolerance to low speculation acceptance rates and low-bandwi
 dth interconnects. PipeInfer exhibits up to a 2.15x improvement in generat
 ion speed over standard speculative inference.\nPipeInfer achieves its imp
 rovement through Continuous Asynchronous Speculation and Early Inference C
 ancellation, the former improving latency and generation speed by running 
 single-token inference simultaneously with several speculative runs, while
  the latter improves speed and latency by skipping the computation of inva
 lidated runs, even in the middle of inference.\n\nTag: Accelerators, Artif
 icial Intelligence/Machine Learning, Cloud Computing, Distributed Computin
 g, Heterogeneous Computing, Performance Optimization\n\nRegistration Categ
 ory: Tech Program Reg Pass\n\nSession Chair: Dong Li (University of Califo
 rnia, Merced)\n\n
END:VEVENT
END:VCALENDAR
