Close

Presentation

Unlocking High-Performance with Low-Bit NPUs and CPUs for Highly Optimized HPL-MxP on Cloud Brain II
DescriptionMix-precision computation is crucial for artificial intelligence and scientific computing applications. However, as novel chips with innovative architectures emerge, harnessing their computational capabilities presents significant challenges. While existing algorithms for the HPL-MxP LU factorization excel on homogeneous systems, they often encounter difficulties on specialized heterogeneous architectures. This deficiency arises from inadequate optimization for computation, memory access, and communication, hindering effective mixed-precision acceleration. This work introduces an algorithm-hardware co-optimization approach for LU factorization on specialized NPUs and CPUs, leveraging their unique architectures. A novel multi-iteration fusion method for general matrix multiplication is introduced, strategically designed to maximize on-chip L1 buffer utilization, effectively overcoming the notorious "memory wall". Additionally, a multi-stage, multi-level heterogeneous pipeline for LU factorization in an accelerator-CPU cloud environment is presented, where compute-intensive matrix multiplications are offloaded to NPUs while CPUs handle the remaining tasks. The co-optimization approach fosters deep collaboration between CPUs and accelerators, thereby unlocking enhanced performance.
Event Type
Paper
TimeThursday, 21 November 202410:30am - 11am EST
LocationB309
Tags
Accelerators
Artificial Intelligence/Machine Learning
Codesign
State of the Practice
System Administration
Registration Categories
TP