Presentation
M3XU: Achieving High-Precision and Complex Matrix Multiplication with Low-Precision MXUs
DescriptionBeyond the high-profile artificial intelligence and machine learning (AI/ML) workloads, the demand for high-performance matrix operations on standard and complex floating-point numbers remains strong but underserved. However, the widely adopted low-precision matrix processing units (MXUs) can only fulfill the need for AI/ML workloads, which are underutilized or idle when running applications outside their target domains.
This paper presents M3XU, multi-mode matrix processing units that support IEEE 754 single-precision and complex 32-bit floating-point numbers. M3XU does not rely on more precise but costly multipliers. Instead, M3XU proposes a multi-step approach that extends existing MXUs for AI/ML workloads. The resulting M3XU can seamlessly upgrade existing systems without programmers' efforts and maintain the bandwidth demand of existing memory subsystems. This paper evaluates M3XU with full-system emulation and hardware synthesis. M3XU can achieve a 3.89x speedup for 32-bit matrix multiplications and 3.8x speedup for complex number operations compared with conventional vector processing units.
This paper presents M3XU, multi-mode matrix processing units that support IEEE 754 single-precision and complex 32-bit floating-point numbers. M3XU does not rely on more precise but costly multipliers. Instead, M3XU proposes a multi-step approach that extends existing MXUs for AI/ML workloads. The resulting M3XU can seamlessly upgrade existing systems without programmers' efforts and maintain the bandwidth demand of existing memory subsystems. This paper evaluates M3XU with full-system emulation and hardware synthesis. M3XU can achieve a 3.89x speedup for 32-bit matrix multiplications and 3.8x speedup for complex number operations compared with conventional vector processing units.