Close

Presentation

Performance evaluation and modelling of single-precision matrix multiplication on Cerebras CS-2
DescriptionAlthough recent supercomputers have been improving their computational performance, achieving performance scaling with respect to the number of nodes is not easy due to long inter-node communication latency. Many attempts have been made to hide communication latency and maintain strong scalability even for dense matrix multiplication. Matrix multiplication is an ideal candidate for benchmarking the performance of supercomputers. The Cerebras CS-2 system is an accelerator for deep learning with the world’s largest chip, the wafer-scale engine 2 (WSE-2). The WSE-2 can be considered a distributed memory system that comes with 745500 processing elements connected in a low-latency 2D mesh topology. This paper presents the maximum performance, weak and strong scaling performance, and proposes a performance model for single-precision matrix multiplication on CS-2. We observed the maximum performance of 349.0 TFlops/s (matrix size: 33000x33000) and a weak scaling efficiency of 1.00. The mean absolute percentage error of the model was 4.7%.
Event Type
Workshop
TimeSunday, 17 November 20244:20pm - 4:30pm EST
LocationB310
Tags
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
Registration Categories
W