Close

Presentation

Algorithmic and Optimization Techniques for Graph Applications in Heterogeneous Systems at Scale
DescriptionAs heterogeneity becomes commonplace in HPC systems, algorithmic and optimization techniques are needed to address the challenges that come with it, especially for irregular applications. This includes workload balancing, scheduling, latency tolerance, and memory utilization and contention, among others. This showcase covers three works addressing key questions in running complex irregular graph applications on heterogeneous systems: programmability, performance portability, memory efficiency, load balancing, and scalability.

The first work explores the efficacy of utilizing commercial high-level synthesis tools to accelerate two different graph sampling methods on FPGAs. We achieve up to a 40x speedup compared to the baseline OpenCL kernel, and identify key areas for toolchain improvements, such as memory subsystems and latency tolerance.

The second work focuses on improving breadth-first probabilistic traversals (BPTs), as they dominate runtime in some applications. By identifying and exploiting redundancies in edge accesses, we achieve an average of 75x and 135x speedups when deployed on two different frameworks. We also demonstrate strong scaling up to 4,096 nodes on OLCF Frontier enabled by CPU-GPU heterogeneous workload balancing.

The third work is currently in progress, exploring the use of lossy compression to enable training on graph neural networks. We have promising preliminary results, showing a compression ratio of between 6x-20x with minimal accuracy loss on both GCN and GAT. We identify future directions and use cases for this method with an emphasis on systems integration such as larger batch sizes in mini-batch training, compressing feature vector caches, and adaptive compression methods for heterogeneous and dynamic GNNs.