Presentation
PIMnast: Balanced Data Placement for GEMV Acceleration with Processing-In-Memory
DescriptionWith unprecedented demand for GenAI inference, acceleration of primitives that dominate GenAI, such as GEMV, is receiving considerable attention. A challenge with GEMVs is the high memory-bandwidth this primitive demands. Multiple memory vendors have proposed commercially-viable PIM prototypes that attain bandwidth boost over processor via augmenting memory banks with compute capabilities and broadcasting same command to all banks. While proposed PIM designs stand to accelerate GEMV, we observe that a key impediment to harness PIM acceleration is deducing optimal data-placement to place the matrix in memory banks. To this end, we tease out factors that impact data-placement and propose PIMnast which, like a gymnast, balances these factors to identify data-placements that deliver GEMV acceleration. Across a spectrum of GenAI models, PIMnast, along with additional orchestration knobs we identify, delivers up to 6.86x speedup for GEMVs (of the available 7x roofline speedup) leading to up to 5x speedup for per-token latencies.