Close

Presentation

Stalls and Memory Analysis on Fujitsu A64FX and NVIDIA Grace
DescriptionARM-based multicore CPUs, such as NVIDIA Grace and Fujitsu A64FX, dominate contemporary HPC, featuring 32-256 cores with cache hierarchies and up to 1 TB/s memory bandwidth. While benchmarks like STREAM show similar performance across these systems, diverse applications, particularly graph and nearest-neighbor (e.g., stencils), reveal distinct performance profiles. Analyzing these profiles with low-level performance data can uncover system bottlenecks. We propose a template focusing on stalls and memory accesses to identify bottlenecks efficiently by studying key CPU/memory performance events using Linux perf. Our approach engages all cores (144 for Grace, 48 for A64FX) with platform-specific compilers (ARMClang 24.04 for Grace, Fujitsu 4.10 for A64FX). This method effectively categorizes application scenarios by analyzing stalls and memory accesses, enabling quick identification of corner cases.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
Registration Categories
TP
XO/EX