Search Program
Organizations
Contributors
Presentations
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe recorded visualization was built with JavaScript using the D3 and Anime.js libraries. Historical run data from the Kestrel supercomputer was queried using SQL from NREL's internal sys admin database and bundled into a JSON file for use by the JavaScript code. The JSON file was organized by minute-long time-steps, each signifying the state of the jobs in a particular case on the supercomputer at a particular point in time.
The video itself is a screen recording of the visualization being run on a MacBook Pro with an Intel i9 2.3 GHz 8-Core processor and an AMD Radeon Pro 5500 Graphics Card. The screen recording software is the default Quicktime Software and it was edited for length using Adobe Premiere Pro.
The video itself is a screen recording of the visualization being run on a MacBook Pro with an Intel i9 2.3 GHz 8-Core processor and an AMD Radeon Pro 5500 Graphics Card. The screen recording software is the default Quicktime Software and it was edited for length using Adobe Premiere Pro.
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionIn today’s HPC landscape and especially in tomorrow’s even more complex systems, performance optimization and portability are critical for maximizing computational efficiency, minimizing energy consumption, and ensuring that applications can seamlessly adapt to rapidly evolving heterogeneous architectures.This talk will discuss challenges and solutions for performance optimization and portability of applications in modern HPC systems featuring increasingly heterogeneous architectures. Drawing from recent experiences in optimizing legacy applications, new simulation frameworks, and complex data analysis pipelines, we will examine approaches to effectively leveraging multiple levels of parallelism—both within nodes and across nodes—while maintaining performance portability. Topics will include scheduling libraries and autotuning, scalable domain decomposition, and runtime scheduling of workflows integrating AI, data management, and simulations. The discussion will conclude with recommendations for exploiting the multilevel parallelism and heterogeneity of next-generation accelerated HPC systems.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThis artifact was created with Pandas, Matplotlib, and NetworkX. Data was gathered from the Kestrel cluster at the National Renewable Energy Laboratory with Slurm via the sacct and sinfo commands.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionEdge computing, the notion of moving computational tasks from servers to the data-generating network edge, is an increasingly popular model for data processing. 5G wireless technologies offer an opportunity to enable complex distributed edge computing workflows by minimizing the overhead incurred in transmitting data to peer devices. In this work, we demonstrate the use and performance of edge devices in distributed computation workloads using Hadoop MapReduce on a cluster of six 5G-connected Raspberry Pis. Specifically, we first determine the network capabilities (i.e., latency and throughput) across millimeter wave (mmWave) 5G links and then analyze the scalability and performance of our cluster. Our experiment uses 5G radios at the Agricultural and Rural (ARA) Wireless Living Lab, spanning over six miles in diameter.
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
Task Parallelism
W
DescriptionOne of the most mathematically challenging tasks within the aerospace industry is the design of the aircraft aerodynamics. Indeed, the various aerodynamic physical models involve non-linear partial differential equations of all types : elliptic, parabolic and hyperbolic. These are solved using Computational Fluid Dynamics, and require High-Performance Computing when solving for million/billion unknowns. The presentation will cover advances in the fidelity of aerodynamic models and applications to aerothermodynamic models towards ice accretion. In particular, it will highlight the use of the Chapel language in the main holistic solver within Prof. Laurendeau’s aerodynamic laboratory.
Workshop
Codesign
Data Movement and Memory
Facilities
W
DescriptionUnderstanding the performance potential and data placement challenges in Non-Uniform Memory Access (NUMA) architectures is crucial for optimizing High-Performance Computing (HPC) systems. We will present a quantitative approach, using simulations and models, that provides essential insights into how system architecture impacts microbenchmarks and real-world applications. We model a NUMA architecture with ARMv8 Neoverse V1 processors, leveraging the gem5 and VPSim simulation platforms. Combining these tools enables us to optimize simulation speed during early-stage exploration while preserving the accuracy necessary to evaluate design performance in later stages.We will present case studies that examine the performance implications of different NUMA node configurations, SLC (System Level Cache) group assignments, and Network-on-Chip (NoC) settings. These case studies reveal critical design trade-offs, offering valuable input for the co-design process, where HPC SoC architects and system integrators collaborate. This work is conducted within the European Processor Initiative (EPI) framework, focusing on developing new, energy-efficient hardware architectures for future exascale systems.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionTraining large language models is becoming increasingly complex due to the rapid expansion in their size, resulting in significant computational costs. To address this challenge, various model growth methodologies have been proposed to leverage smaller pre-trained models to incrementally build larger models and reduce computational requirements. These methods typically involve mapping parameters from small models to large ones using either static functions or learned mappings. Although these approaches have demonstrated effectiveness, there is a lack of comprehensive comparative evaluations in the literature. Additionally, combining different methodologies could potentially yield superior performance. This study provides a uniform evaluation of multiple state-of-the-art model growth techniques and their combinations, revealing that efficient combination techniques can reduce the training cost (in TFLOPs) of individual methods by up to 80%.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionWe are designing an automatic ticket answering service for computing centers such as Texas Advanced Computing Center (TACC), National Center for Supercomputing Applications (NCSA), and San Diego Supercomputer Center (SDSC). In this work, we investigate the capability and feasibility of open source language models (LLMs) on the ticket answering task. We compare four open source LLMs (OPT-6.7B, Falcon-7B, Llama 2-7B, and Llama 3.1-8B) by fine-tuning them with a curated dataset with over 110,000 historical question/answer pairs. Our results show that fine-tuned LLMs are capable of generating reasonable answers. Llama-7B has a lower validation loss and perplexity than OPT-6.7B and Falcon-7B. We also observe that fine-tuning with LoRA introduces non-trivial generalization loss compared with dense fine-tuning. We will design an evaluation dataset and perform quantitative evaluation for the three LLMs in the future.
Paper
Algorithms
Artificial Intelligence/Machine Learning
Graph Algorithms
Linear Algebra
TP
DescriptionExploiting matrix symmetry to halve memory footprint offers an opportunity for accelerating memory-bound computations like Sparse Matrix-Vector Multiplication (SpMV). However, symmetric SpMV incurs data conflicts when concurrently writing the output vector. Previous approaches fail to address this issue efficiently. This paper proposes DCS-SpMV, a Divide-and-Conquer (DC) algorithm for efficient Symmetric SpMV. The key idea is to recursively divide the matrix-induced onflict graph into independent subgraphs for parallel execution, and construct separate subgraphs to avoid data conflicts. Our DC algorithm transforms the input matrix into a lowconflict part and a high-conflict part, which motivates us to design a conflict-aware hybrid solution that executes these two parts using DCS-SpMV and traditional SpMV respectively.
We develop a machine learning model to predict an optimal hybrid implementation for a given matrix and architecture. We evaluate our work on both X86 and ARM CPUs, demonstrating significant performance improvement over the state-of-the-art.
We develop a machine learning model to predict an optimal hybrid implementation for a given matrix and architecture. We evaluate our work on both X86 and ARM CPUs, demonstrating significant performance improvement over the state-of-the-art.
Paper
Accelerators
Energy Efficiency
Facilities
Resource Management
State of the Practice
TP
DescriptionWe present ExaDigiT, an open-source framework for developing comprehensive digital twins of liquid-cooled supercomputers. It integrates three main modules: (1) a resource allocator and power simulator, (2) a transient thermo-fluidic cooling model, and (3) an augmented reality model of the supercomputer and central energy plant. The framework enables the study of ``what-if'' scenarios, system optimizations, and virtual prototyping of future systems. Using Frontier as a case study, we demonstrate the framework's capabilities by replaying six months of system telemetry for systematic verification and validation. Such a comprehensive analysis of a liquid-cooled exascale supercomputer is the first of its kind. ExaDigiT elucidates complex transient cooling system dynamics, runs synthetic or real workloads, and predicts energy losses due to rectification and voltage conversion. Throughout our paper, we present lessons learned to benefit HPC practitioners developing similar digital twins. We envision the digital twin will be a key enabler for sustainable, energy-efficient supercomputing.
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
DescriptionAs new compute systems are developed, there is still a need to compile Fortran for execution on leading edge systems.
In order to achieve this, compilers are continuously under development.
Though the specification of Fortran is extensive, it is helpful to prioritize the development of key features of desired applications to get to execute them as soon as possible.
Identifying key features largely done through querying software experts, who then manually report on which features key features are present.
This is both time consuming and error prone.
To automate this, we present a compiler plugin to Flang that operates on a program's parse tree representation and detects key features.
We show the result of our tool on four applications.
We show the discrepancies between our tool and the manual characterization of three of the applications, as well as generate a characterization for an application not yet profiled.
In order to achieve this, compilers are continuously under development.
Though the specification of Fortran is extensive, it is helpful to prioritize the development of key features of desired applications to get to execute them as soon as possible.
Identifying key features largely done through querying software experts, who then manually report on which features key features are present.
This is both time consuming and error prone.
To automate this, we present a compiler plugin to Flang that operates on a program's parse tree representation and detects key features.
We show the result of our tool on four applications.
We show the discrepancies between our tool and the manual characterization of three of the applications, as well as generate a characterization for an application not yet profiled.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionThere are significant differences between emerging AI and data analytics workloads and traditional HPC workloads with regard to storage and programming frameworks. We extend DAOS with a queryable global shared low-latency/high-bandwidth cache and a resilient runtime that intercepts calls to popular analytics frameworks and offloads them to worker processes running on the HPC system. The result is a solution that offers bandwidth and latency benefits over vanilla DAOS and that enables ordinary programmers to interactively use popular programming frameworks like Python to solve huge problems on HPC systems without stranding resources.
Workshop
State of the Practice
System Administration
W
DescriptionAccurate wait time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the accuracy needed for confident predictions, and many were developed before the rise of deep learning.
In this work, we investigate and develop a neural network-based model to accurately predict wait times for jobs submitted to the Anvil HPC cluster. Data was taken from the Slurm Workload Manager on the cluster and transformed before performing additional feature engineering from jobs’ priorities, partitions, and states. We developed a hierarchical model that classifies job queue times into bins before applying regression, outperforming traditional methods. The model was then integrated into a CLI tool for queue time prediction. This study explores which queue time prediction methods are most applicable for modern HPC systems and shows that deep learning-based prediction models are viable solutions.
In this work, we investigate and develop a neural network-based model to accurately predict wait times for jobs submitted to the Anvil HPC cluster. Data was taken from the Slurm Workload Manager on the cluster and transformed before performing additional feature engineering from jobs’ priorities, partitions, and states. We developed a hierarchical model that classifies job queue times into bins before applying regression, outperforming traditional methods. The model was then integrated into a CLI tool for queue time prediction. This study explores which queue time prediction methods are most applicable for modern HPC systems and shows that deep learning-based prediction models are viable solutions.
Paper
Algorithms
Data Movement and Memory
I/O, Storage, Archive
Performance Optimization
Scientific and Information Visualization
Visualization
TP
DescriptionMulti-resolution methods such as Adaptive Mesh Refinement (AMR) can enhance storage efficiency for HPC applications generating vast volumes of data. However, their applicability is limited and cannot be universally deployed across all applications. Furthermore, integrating lossy compression with multi-resolution techniques to further boost storage efficiency encounters significant barriers. To this end, we introduce an innovative workflow that facilitates high-quality multi-resolution data compression for both uniform and AMR simulations. Initially, to extend the usability of multi-resolution techniques, our workflow employs a compression-oriented Region of Interest (ROI) extraction method, transforming uniform data into a multi-resolution format. Subsequently, to bridge the gap between multi-resolution techniques and lossy compressors, we optimize three distinct compressors, ensuring their optimal performance on multi-resolution data. Lastly, we incorporate an advanced uncertainty visualization method into our workflow to understand the potential impacts of lossy compression. Experimental evaluation demonstrates that our workflow achieves significant compression quality improvements.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionThere has been a healthy growth of heterogeneous programming models that cover different paradigms in the HPC space.
Selecting an appropriate programming model for new projects is challenging: how does one select a model that is both productive and performant?
The same applies for existing projects aiming to leverage heterogeneous offload capabilities.
While characterisation of programming model performance has been abundant and comprehensive, productivity metrics are often reduced to basic measures like Source Line of Code (SLOC).
This study introduces a novel model divergence measure to objectively evaluate productivity.
We cover common aspects of productivity, including syntax, semantics, and optimisation overhead.
We present a productivity analysis framework supporting GCC and Clang, covering models for C/C++ and Fortran.
We evaluate our metric using this framework on mini-apps from SPEChpc and other established mini-apps, and propose a combined productivity and performance probability visualisation for a comprehensive picture.
Selecting an appropriate programming model for new projects is challenging: how does one select a model that is both productive and performant?
The same applies for existing projects aiming to leverage heterogeneous offload capabilities.
While characterisation of programming model performance has been abundant and comprehensive, productivity metrics are often reduced to basic measures like Source Line of Code (SLOC).
This study introduces a novel model divergence measure to objectively evaluate productivity.
We cover common aspects of productivity, including syntax, semantics, and optimisation overhead.
We present a productivity analysis framework supporting GCC and Clang, covering models for C/C++ and Fortran.
We evaluate our metric using this framework on mini-apps from SPEChpc and other established mini-apps, and propose a combined productivity and performance probability visualisation for a comprehensive picture.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionMicroservices architecture is a promising approach for developing reusable scientific workflow capabilities for integrating diverse resources, such as experimental and observational instruments and advanced computational and data management systems, across many distributed organizations and facilities.
In this paper, we describe how the INTERSECT Open Architecture leverages federated systems of microservices to construct interconnected science ecosystems, review how the INTERSECT software development kit eases microservice capability development, and demonstrate the use of such capabilities for deploying an example multi-facility INTERSECT ecosystem.
In this paper, we describe how the INTERSECT Open Architecture leverages federated systems of microservices to construct interconnected science ecosystems, review how the INTERSECT software development kit eases microservice capability development, and demonstrate the use of such capabilities for deploying an example multi-facility INTERSECT ecosystem.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionFederated learning is a privacy-preserving machine learning approach. It allows numerous geographically distributed clients to collaboratively train a large model while maintaining local data privacy. In heterogeneous device settings, limited network bandwidth is a major bottleneck that constrains system performance. In this work, we propose a novel gradient compression method for federated learning that aims to achieve communication efficiency and a low error floor by estimating the prototype of gradients on both the server and client sides and sending only the difference between the real gradient and the estimated prototype. This approach further reduces the total bits required for model updates. Additionally, the memory requirement will be lighter on the client side but heavier on the server side compared to traditional error feedback methods. Experiments on training neural networks show that our method is more communication-efficient with little impact on training and test accuracy.
Workshop
Algorithms
Heterogeneous Computing
W
DescriptionDomain scientists in the field of computational science often face challenges in developing optimized code for high-performance computing, especially GPUs. Considering the increase of heterogeneity in a node of HPC computing facilities, there is a demand to develop performance portable solutions for the core computation kernels in a scientific library. We demonstrate a
performance portable multi-GPU solution for an implementation of the Euler equations using ProtoX and IRIS. ProtoX is a domain-specific language using a partial differential equation library called Proto as its front end and SPIRAL, a code generation system, in its back end to generate optimized kernels for different architectures. These kernels are orchestrated through the intelligent runtime system --- IRIS to provide portability. Two levels of optimizations within IRIS, namely DAG and task fusion, are explored to efficiently utilize computing resources in a multi-GPU environment. Performance improvement through these optimizations is showcased on AMD and NVIDIA GPUs.
performance portable multi-GPU solution for an implementation of the Euler equations using ProtoX and IRIS. ProtoX is a domain-specific language using a partial differential equation library called Proto as its front end and SPIRAL, a code generation system, in its back end to generate optimized kernels for different architectures. These kernels are orchestrated through the intelligent runtime system --- IRIS to provide portability. Two levels of optimizations within IRIS, namely DAG and task fusion, are explored to efficiently utilize computing resources in a multi-GPU environment. Performance improvement through these optimizations is showcased on AMD and NVIDIA GPUs.
ACM Gordon Bell Climate Modeling Finalist
TP
DescriptionOcean general circulation models (OGCMs) are indispensable for studying multi-scale oceanic processes and climate change. High-resolution ocean simulations require immense computational power and thus become a challenge in climate science. We present LICOMK++, a performance-portable OGCM using Kokkos, to facilitate global kilometer-scale ocean simulations. The breakthroughs include:
(1) We enhance cutting-edge Kokkos with the Sunway architecture, enabling LICOMK++ to become the first performance-portable OGCM on diversified architectures, i.e., Sunway processors, CUDA/HIP-based GPUs, and ARM CPUs.
(2) LICOMK++ overcomes the one simulated-years-per-day (SYPD) performance challenge for global realistic OGCM at 1-km resolution. It records 1.05 and 1.70 SYPD with a parallel efficiency of 54.8% and 55.6% scaling on almost the entire new Sunway supercomputer and two-thirds of the ORISE supercomputer.
(3) LICOMK++ is the first global 1-km-resolution realistic OGCM to generate scientific results. It successfully reproduces mesoscale and submesoscale structures that have considerable climate effects.
(1) We enhance cutting-edge Kokkos with the Sunway architecture, enabling LICOMK++ to become the first performance-portable OGCM on diversified architectures, i.e., Sunway processors, CUDA/HIP-based GPUs, and ARM CPUs.
(2) LICOMK++ overcomes the one simulated-years-per-day (SYPD) performance challenge for global realistic OGCM at 1-km resolution. It records 1.05 and 1.70 SYPD with a parallel efficiency of 54.8% and 55.6% scaling on almost the entire new Sunway supercomputer and two-thirds of the ORISE supercomputer.
(3) LICOMK++ is the first global 1-km-resolution realistic OGCM to generate scientific results. It successfully reproduces mesoscale and submesoscale structures that have considerable climate effects.
Paper
Accelerators
Artificial Intelligence/Machine Learning
Codesign
State of the Practice
System Administration
TP
DescriptionModern scientific software in high performance computing is often complex, and many parallel applications and libraries depend on several other software or libraries. Developers and users of such complex software often use package managers for building them. Package managers depend on humans to codify package constraints, and the dependency graph of a software package can often become large. In this paper, we propose a methodology that uses historical build results to assist a package manager in selecting the best versions of package dependencies with an aim to improve the likelihood of a successful build. We train a machine learning (ML) model to predict the probability of build outcomes of different configurations of packages in the Spack package manager. When evaluated on common scientific software stacks, this ML model-based approach is able to achieve a 13% higher success rate in building packages than the default version selection mechanism in Spack.
Paper
Accelerators
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Distributed Computing
Graph Algorithms
Heterogeneous Computing
Tensors
TP
Best Student Paper Finalist
DescriptionFIRAL is a recently proposed deterministic active learning algorithm for multiclass classification using logistic regression. It was shown to outperform the state-of-the-art in terms of accuracy and robustness and comes with theoretical performance guarantees. However, its scalability suffers when dealing with datasets featuring a large number of points $n$, dimensions $d$, and classes $c$, due to its $\mathcal{O}(c^2d^2+nc^2d)$ storage and $\mathcal{O}(c^3(nd^2 + bd^3 + bn))$ computational complexity where $b$ is the number of points to select. To address these challenges, we propose an approximate algorithm with storage requirements reduced to $\mathcal{O}(n(d+c) + cd^2)$ and a computational complexity of $\mathcal{O}(bncd^2)$. Additionally, we present a parallel implementation on GPUs. We demonstrate the accuracy and scalability of our approach using MNIST, CIFAR-10, Caltech101, and ImageNet. The accuracy tests reveal no deterioration compared to FIRAL. We report strong and weak scaling tests on up to 12 GPUs, for three million point synthetic dataset.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionAI-based foundation models like FourCastNet, GraphCast are revolutionizing weather and climate predictions but are not yet ready for operational use. Their limitation lies in the absence of a data assimilation system to incorporate real-time Earth system observations, crucial for accurately forecasting events like tropical cyclones. To overcome these obstacles, we introduce a generic real-time data assimilation framework and demonstrate its end-to-end performance on the Frontier supercomputer. This framework comprises two primary modules: an ensemble score filter (EnSF), which significantly outperforms the state-of-the-art data assimilation method, and a vision transformer-based surrogate capable of real-time adaptation through the integration of observational data. We demonstrate both the strong and weak scaling of our framework up to 1024 GPUs on the Exascale supercomputer, Frontier. Our results not only illustrate the framework's exceptional scalability on high-performance computing systems, but also demonstrate the importance of supercomputers in real-time data assimilation for weather and climate predictions.
Workshop
Applications and Application Frameworks
Algorithms
Performance Evaluation and/or Optimization Tools
W
DescriptionGenerative artificial intelligence extends beyond its success in image/text synthesis, proving itself a powerful uncertainty quantification (UQ) technique through its capability to sample from complex high-dimensional probability distributions. However, existing methods often require a complicated training process, which greatly hinders their applications to real-world UQ problems. To alleviate this challenge, we developed a scalable, training-free score-based diffusion model for high-dimensional sampling. We incorporate a parallel-in-time method into our diffusion model to use a large number of GPUs to solve the backward stochastic differential equation and generate new samples of the target distribution. Moreover, we also distribute the computation of the large matrix subtraction used by the training-free score estimator onto multiple GPUs available across all nodes. We showcase the remarkable strong and weak scaling capabilities of the proposed method on the Frontier supercomputer, as well as its uncertainty reduction capability in hurricane predictions when coupled with AI-based foundation models.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionScientific workflows and provenance are two faces of the same medal. While the former addresses the coordinated execution of multiple tasks over a set of computational resources, the latter relates to the historical record of data from its original sources. This paper highlights the importance of tracking multi-level provenance metadata in complex, AI-based scientific workflows as a way to (i) foster and (ii) expand documentation of experiments, (iii) enable reproducibility, (iv) address interpretability of the results, (v) facilitate performance bottlenecks diagnosis, and (vi) advance provenance exploration and analysis opportunities.
Workshop
Architecture
Embedded and/or Reconfigurable Systems
Performance Optimization
Resource Management
W
DescriptionQuantum computers are making their way into High Performance Computing centers in the form of accelerators. Due to their physical implementation as mostly large appliances in separate racks, their number in typical data centers is significantly lower than the number of nodes offloading work to them, unlike the case with GPU accelerators. As a consequence, they form large-scale disaggregated infrastructures that pose a number of integration challenges due to their diverse implementation technologies and their need to be used as a shared resource for optimal utilization. Running hybrid High Performance Computing-Quantum Computing (HPCQC) applications in HPC environments, where the quantum portion is offloaded to the quantum processing units requires sophisticated resource management strategies to optimize resource utilization and performance. In this paper, we present the Munich Quantum Software Stack (MQSS), a Just-In-Time (JIT) compilation and execution software stack tailored for integrating disaggregated quantum accelerators into traditional HPC workflows.
Posters
TP
DescriptionKnowledge graph (KG) learning offers a powerful framework for generating new knowledge and making inferences. Training KG embedding can take a significantly long time, especially for larger datasets. Our analysis shows that the gradient computation of embedding and vector normalization are the dominant functions in the KG embedding training loop. We address this issue by replacing the core embedding computation with SpMM (Sparse-Dense Matrix Multiplication) kernels. This allows us to unify multiple scatter (and gather) operations as a single operation, reducing training time and memory usage. Applying this sparse approach in training the TransE model results in up to 5.7x speedup on the CPU and up to 1.7x speedup on the GPU. Distributing this algorithm on 64 GPUs, we observe up to 3.9x overall speedup in each epoch. Our proposed sparse approach can also be extended to accelerate other translation-based models such as TransR and TransH.
Paper
Algorithms
Artificial Intelligence/Machine Learning
Graph Algorithms
Linear Algebra
TP
DescriptionMultiplying two sparse matrices (SpGEMM) is a common computational primitive used in many areas including graph algorithms, bioinformatics, algebraic multigrid solvers, and randomized sketching. Distributed-memory parallel algorithms for SpGEMM have mainly focused on sparsity-oblivious approaches that use 2D and 3D partitioning. Sparsity-aware 1D algorithms can theoretically reduce communication by not fetching nonzero of the sparse matrices that do not participate in the multiplication.
Here, we present a distributed-memory 1D SpGEMM algorithm and implementation. It uses MPI RDMA operations to mitigate the cost of packing/unpacking submatrices for communication, and it uses a block fetching strategy to avoid excessive fine-grained messaging. Our results show that our 1D implementation outperforms state-of-the-art 2D and 3D implementations within CombBLAS for many configurations, inputs, and use cases, while remaining conceptually simpler.
Here, we present a distributed-memory 1D SpGEMM algorithm and implementation. It uses MPI RDMA operations to mitigate the cost of packing/unpacking submatrices for communication, and it uses a block fetching strategy to avoid excessive fine-grained messaging. Our results show that our 1D implementation outperforms state-of-the-art 2D and 3D implementations within CombBLAS for many configurations, inputs, and use cases, while remaining conceptually simpler.
Invited Talk
TP
DescriptionPerformance. Power. Startup. Sculpture. Music. This might seem like a disparate set of topics to describe one person's research in HPC. PowerPack. The Green500. grano.la. SeeMore. The CSGenome. Do these artifacts help? Maybe not. If life is a journey, so is the story of my research. I can assure you that all the research topics and artifacts I've studied and created along the way have common roots in HPC's sustainability. Another commonality is that early on, some in our community deemed these topics or artifacts a waste of time and resources, a non-problem, soon to be made irrelevant, or further evidence that this researcher is losing his grip on reality. Impact. Impact. Impact. In this talk, I will share key research findings and outcomes that have proven the naysayers wrong over time. I will also describe the inspiration and genesis of the work and the connections among these seemingly incongruous research projects. The goal of every endeavor so far — and the journey is far from over — has been to create sustained change in the way HPC computes and to ensure broad audiences understand the importance of what we do and how it connects to their everyday lives. And, perhaps surprisingly to some, I owe much of the success to the arts.
Workshop
A Study of a Deterministic Networking Framework for Latency Critical Large Scientific Data Transfers
Architecture
Data-Intensive
Network
Performance Optimization
System Administration
W
DescriptionScientific workflows often involve large data transfers, which increasingly require completion-time guarantees. To support these time-sensitive flows, the Energy Science Network (ESnet) has implemented on-demand circuits with packet priority, allowing the circuit to be utilized by other traffic when the deadline-sensitive flow is inactive. We explore a deterministic networking framework designed to support large scientific data transfers with completion guarantees. We consider an ideal network where all nodes are time-synchronized and utilize Cyclic Queueing and Forwarding (CQF) to achieve reliable low-latency data transfers. Our results show that the deterministic network architecture achieves performance comparable to the dynamic bandwidth reservation scheme. We believe that a more optimized version of the time-sensitive networking protocol that exploit multi-path routing could offer better completion guarantees than traditional network reservation options while improving overall network bandwidth utilization.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionPrograms like Girls Who Code (GWC) are pivotal in working to inspire and equip young women with the skills and confidence needed to pursue careers in computing. Understanding the impact of such initiatives is particularly important for addressing the decline in interest among girls aged 13 to 17, a critical period for career decision-making. By evaluating the effectiveness of employing a GWC club at our university, this research aims to uncover strategies that can successfully attract and retain women in computer science (CS) in our region. The goal is to not only reverse the trend of declining female participation but also to sustain their interest in the field of computing.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionPersistent Memory (PM) is a promising next-generation storage device, combining features of both volatile memory (like DRAM) and non-volatile memory (like SSDs). Many studies use PM to optimize training to advance deep learning technology. However, these studies have not addressed the issue of multiple copies of training data during deep learning, leading to reduced training efficiency. In this study, we first analyze the characteristics of PM and mainstream file systems. We then explore PM's byte addressability to manage metadata and data efficiently. This approach minimizes multiple I/O operations of tasks involving repeated read-write data accesses, such as machine learning datasets, enabling zero-copy data handling and significant speedups of read-and-write operations.
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionAutomatic code optimization is a complex process that typically involves the application of multiple discrete algorithms that modify the program structure irreversibly. However, the design of these algorithms is often monolithic, and they require repetitive implementation to perform similar analyses due to the lack of cooperation. To address this issue, modern optimization techniques, such as equality saturation, allow for exhaustive term rewriting at various levels of inputs, thereby simplifying compiler design.
In this paper, we propose equality saturation to optimize sequential codes utilized in directive-based programming for GPUs. Our approach simultaneously realizes less computation, less memory access, and high memory throughput. Our fully-automated framework constructs single-assignment forms from inputs to be entirely rewritten while keeping dependencies and extracts optimal cases. Through practical benchmarks, we demonstrate a significant performance improvement on several compilers. Furthermore, we highlight the advantages of computational reordering and emphasize the significance of memory-access order for modern GPUs.
In this paper, we propose equality saturation to optimize sequential codes utilized in directive-based programming for GPUs. Our approach simultaneously realizes less computation, less memory access, and high memory throughput. Our fully-automated framework constructs single-assignment forms from inputs to be entirely rewritten while keeping dependencies and extracts optimal cases. Through practical benchmarks, we demonstrate a significant performance improvement on several compilers. Furthermore, we highlight the advantages of computational reordering and emphasize the significance of memory-access order for modern GPUs.
Paper
Accelerators
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Modeling and Simulation
Numerical Methods
TP
DescriptionSimulating emerging resistive switching memory devices, such as memristors, requires modeling frameworks that can treat the motion of point defects across nanoscale domains. Field-driven Kinetic Monte Carlo (d-KMC) methods that simulate the discrete structural evolution of atomic coordinates in the presence of external potential and heat fields can be used for this purpose. While physically similar to conventional KMC methods, field-driven approaches present different computational motifs and introduce global communication. Here, we develop the first scalable d-KMC code for resistive memory arrays at atomistic resolution. We accelerate this latency-sensitive simulation on the GPU partition of the LUMI Supercomputer, exploiting the high-speed interconnects between GPUs on the same node. Applied to the technologically relevant HfOx material stack, our code enables the first atomistic simulation of 3x3 arrays of resistive switching memory cells with more than 1 million atoms, matching the dimensions of fabricated structures.
Tutorial
Accelerators
Emerging Technologies
Numerical Methods
Parallel Programming Methods, Models, Languages and Environments
Quantum Computing
TUT
DescriptionGPU-accelerated quantum simulations are increasingly being adopted in hybrid quantum-classical algorithm development to speed up algorithm run-time, to test and implement future parallel QPU workflows, to scale up the size of quantum research, and to deploy workflows where QPUs and GPUs are tightly coupled. This tutorial guides attendees through examples simulated on their laptops to GPUs on NVIDIA Quantum Cloud. We then focus on running industry-relevant quantum research problems on HPC systems. The tutorial begins with an interactive Jupyter notebook demonstrating parallel quantum simulation using open-source CUDA-Q (introductory material). Next, the tutorial enables attendees to deploy quantum software on large scale HPC clusters like Perlmutter to run, for example, a 30,000 term Hamiltonian using 100 GPUs across multiple nodes (intermediate and advanced material). The tutorial ends with a presentation on QuEra machines and their capabilities along with a hands-on example setting up a Quantum Reservoir Models on QuEra’s platform (intermediate and advanced material). This is the software to be used: https://nvidia.github.io/cuda-quantum/latest/index.html This is the Docker image to be used (or extended to be optimal on Perlmutter): https://catalog.ngc.nvidia.com/orgs/nvidia/teams/quantum/containers/cuda-quantum
Workshop
Algorithms
Heterogeneous Computing
W
DescriptionMethods to mitigate the kernel launch overhead, one of drawbacks of GPUs, were implemented to an overhead-sensitive atmospheric model using OpenACC and CUDA and were evaluated. OpenACC enables kernels to run asynchronously in either one or multiple GPU queues. Moreover, CUDA allows different loops to be collocated in one kernel by branching operations based on block indices. While the default synchronous execution on A100 GPU lagged behind the A64FX CPU in strong scaling, the single-queue asynchronous execution reduced the total model runtime by 37, and the kernel fusion of the core application component further accelerated the entire model by approximately 10. In overhead-sensitive applications, the single-queue asynchronous execution is recommended because it can be easily implemented and maintained. If a small number of kernels are executed particularly frequently, it would be worth the effort to eliminate synchronizations and introduce CUDA Graphs, or bundle kernels using CUDA.
Paper
Artificial Intelligence/Machine Learning
Distributed Computing
Heterogeneous Computing
Performance Optimization
TP
DescriptionDLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications. The large size of DLRM models, however, necessitates the use of multiple devices/GPUs for efficient training. A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices. To mitigate this, we introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training. We develop a novel error-bounded lossy compression algorithm, informed by an in-depth analysis of embedding data features, to achieve high compression ratios. Moreover, we introduce a dual-level adaptive strategy for error-bound adjustment, spanning both table-wise and iteration-wise aspects, to balance the compression benefits with the potential impacts on accuracy. We further optimize our compressor for PyTorch tensors on GPUs, minimizing compression overhead. Evaluation shows that our method achieves a 1.38X training speedup with a minimal accuracy impact.
Doctoral Showcase
Posters
TP
DescriptionAdvances in networks, accelerators, and cloud services encourage programmers to reconsider where to compute — such as when fast networks make it cost-effective to compute on remote accelerators despite added latency. Workflow and cloud-hosted serverless computing frameworks can manage multi-step computations spanning federated collections of cloud, high-performance computing, and edge systems, but passing data among computational steps remains a challenge when applications are a composition of multiple distinct software with differing communication and patterns.
This work introduces a new programming paradigm that decouples data flow from control flow by extending the pass-by-reference model to distributed applications. ProxyStore, developed here, implements this paradigm through object proxies that act as wide-area object references with just-in-time resolution. The proxy model enables producers to communicate data unilaterally, transparently, and efficiently to both local and remote consumers. This decoupling enables the dynamic selection of different data movement methods, depending on what data are moved, where data are moved, or when data are moved — a longstanding challenge in distributed applications.
The efficacy of the proxy paradigm is further understood through four high-level proxy-based programming patterns applied to real-world computational science applications. These high-level patterns — distributed futures, streaming, ownership, and stateful actors — make the power of the proxy paradigm accessible for more complex and dynamic distributed program structures. ProxyStore is evaluated through standardized benchmark suites, introduced here, and meaningful science applications, spanning bioinformatics, federated learning, and molecular design, in which substantial improvements in runtime, throughput, and memory usage are demonstrated.
This work introduces a new programming paradigm that decouples data flow from control flow by extending the pass-by-reference model to distributed applications. ProxyStore, developed here, implements this paradigm through object proxies that act as wide-area object references with just-in-time resolution. The proxy model enables producers to communicate data unilaterally, transparently, and efficiently to both local and remote consumers. This decoupling enables the dynamic selection of different data movement methods, depending on what data are moved, where data are moved, or when data are moved — a longstanding challenge in distributed applications.
The efficacy of the proxy paradigm is further understood through four high-level proxy-based programming patterns applied to real-world computational science applications. These high-level patterns — distributed futures, streaming, ownership, and stateful actors — make the power of the proxy paradigm accessible for more complex and dynamic distributed program structures. ProxyStore is evaluated through standardized benchmark suites, introduced here, and meaningful science applications, spanning bioinformatics, federated learning, and molecular design, in which substantial improvements in runtime, throughput, and memory usage are demonstrated.
Paper
Accelerators
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Distributed Computing
Graph Algorithms
Heterogeneous Computing
Tensors
TP
DescriptionDeep Learning Recommendation Models (DLRMs) face challenges due to the high memory needs of embedding tables and significant communication overhead in distributed settings. Traditional methods, like Tensor-Train (TT) decomposition, compress these tables effectively but add computational load. Furthermore, existing frameworks for distributed training are inadequate due to the excessive data exchange requirements.
We introduce EcoRec, an advanced library that boosts DLRM training by integrating TT decomposition with distributed training. EcoRec innovates with a unique computation pattern to streamline TT operations and an optimized multiplication approach, drastically cutting computation time. It implements a novel micro-batching method using sorted indices to slash memory use without extra computation. Moreover, EcoRec employs a pioneering pipeline for embedding layers, promoting even data spread and communication efficiency. Built on PyTorch and CUDA, tested on a 32 GPU cluster, EcoRec dramatically surpasses EL-Rec, delivering up to 3.1× faster training and reducing memory needs by 38.5%.
We introduce EcoRec, an advanced library that boosts DLRM training by integrating TT decomposition with distributed training. EcoRec innovates with a unique computation pattern to streamline TT operations and an optimized multiplication approach, drastically cutting computation time. It implements a novel micro-batching method using sorted indices to slash memory use without extra computation. Moreover, EcoRec employs a pioneering pipeline for embedding layers, promoting even data spread and communication efficiency. Built on PyTorch and CUDA, tested on a 32 GPU cluster, EcoRec dramatically surpasses EL-Rec, delivering up to 3.1× faster training and reducing memory needs by 38.5%.
Doctoral Showcase
Posters
TP
DescriptionModern high-performance computing (HPC) workflows produce massive datasets, often exceeding 100+ TB per day, driven by instruments collecting data at gigabytes per second. These workflows, executed on advanced HPC systems with heterogeneous storage devices, high-performance microprocessors, accelerators, and interconnects, are increasingly complex and often involve non-deterministic computations. In this context, thousands of processes share computing resources using synchronization for consistency. The intricate process interaction and existing non-deterministic operations challenge explorations of workflow behaviors to ensure reproducibility, optimize performance, and reason about what happens when processes compete for resources. Existing reproducibility analysis frameworks are not well-suited to identify the sources and locations of non-determinism and performance variations, as they often focus on the final workflow results and general statistics about workflow performance.
We address these challenges by introducing scalable techniques that accelerate intermediate workflow results' comparison using variation-tolerant hashing of floating-point datasets, thus improving result reproducibility. We also capture workflow performance profiles and benchmark various queries to analyze workflow performance reproducibility. We also identify opportunities to optimize the loading process and indexing of performance data to ensure minimal initialization and querying overhead. Using collected performance data, we propose a cache-aware staggering technique that leverages workflow I/O profiles to reduce bottlenecks and resource contention, particularly in workflows that share the same input data. Our evaluations across molecular dynamics, cosmology, and deep learning workflows demonstrate significant speedup in intermediate results reproducibility analyses compared to state-of-art baselines and our ability to propose workflow execution strategies that maximize cache reuse and minimize execution makespan.
We address these challenges by introducing scalable techniques that accelerate intermediate workflow results' comparison using variation-tolerant hashing of floating-point datasets, thus improving result reproducibility. We also capture workflow performance profiles and benchmark various queries to analyze workflow performance reproducibility. We also identify opportunities to optimize the loading process and indexing of performance data to ensure minimal initialization and querying overhead. Using collected performance data, we propose a cache-aware staggering technique that leverages workflow I/O profiles to reduce bottlenecks and resource contention, particularly in workflows that share the same input data. Our evaluations across molecular dynamics, cosmology, and deep learning workflows demonstrate significant speedup in intermediate results reproducibility analyses compared to state-of-art baselines and our ability to propose workflow execution strategies that maximize cache reuse and minimize execution makespan.
Workshop
Artificial Intelligence/Machine Learning
Codesign
W
DescriptionInspired by the success of the first TPU for ML inference deployed in 2015, Google has developed multiple generations of machine learning supercomputers for efficient ML training and serving, enabling near linear scaling of ML workloads. In this talk, we will present how TPU works as a machine learning supercomputer to benefit a growing number of Google services, including Gemini and Ads. Furthermore, we will have a deep dive into our full-stack co-design methodology that spans across model, software and hardware layers, and how it turns accelerator concepts into reality.
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
PAW-Full
Task Parallelism
W
DescriptionIn this paper, we propose using Partitioned Global Address Space (PGAS) GPU one-sided asynchronous small messages to replace the widely used collective communication calls for sparse input multi-GPU embedding retrieval in deep learning recommendation systems. This GPU PGAS communication approach achieves (1) better communication and computation overlap, (2) smoother network usage, and (3) reduced overhead (due to the data unpack and rearrangement steps associated with collective communication calls). We implement a CUDA embedding retrieval backend for PyTorch that supports the proposed PGAS communication scheme and evaluate it on deep learning recommendation inference passes. Our backend outperforms the baseline using NCCL collective calls, achieving 1.97x speedup for the weak scaling test and 2.63x speedup for the strong scaling test in a 4 GPU NVLink-connected system.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionApplications are increasingly written as dynamic workflows underpinned by an execution framework that manages asynchronous computations across distributed hardware. However, execution frameworks typically offer one-size-fits-all solutions for data flow management, which can restrict performance and scalability. ProxyStore, a middleware layer that optimizes data flow via an advanced pass-by-reference paradigm, has shown to be an effective mechanism for addressing these limitations. Here, we investigate integrating ProxyStore with Dask Distributed, one of the most popular libraries for distributed computing in Python, with the goal of supporting scalable and portable scientific workflows. Dask provides an easy-to-use and flexible framework, but is less optimized for scaling certain data-intensive workflows. We investigate these limitations and detail the technical contributions necessary to develop a robust solution for distributed applications and demonstrate improved performance on synthetic benchmarks and real applications.
Exhibitor Forum
Accelerating Scientific Computing: GPU Optimization Strategies, Challenges, and Performance Outcomes
Accelerators
Software Engineering
TP
XO/EX
DescriptionGPUs are transforming scientific computing by delivering substantial speedups and energy savings. This presentation outlines the development of GPU computing strategies for a leading CFD software. Initially, we identified and optimized computational bottlenecks suitable for GPU acceleration, using an offload model. To overcome limitations imposed by Amdahl's law, we developed a GPU-native solver architecture with streamlined APIs, ensuring seamless integration with existing workflows.
Our optimization strategy also accounts for diverse GPU platforms, implementing platform-specific enhancements for NVIDIA, AMD, and Intel architectures. Scalability was achieved through advanced load-balancing algorithms and improved inter-GPU communication, enabling efficient parallelization for large-scale simulations.
We present results demonstrating significant speedups and energy savings compared to CPU-based methods, highlighting the transformative potential of GPUs in enabling faster, more complex simulations.
Our optimization strategy also accounts for diverse GPU platforms, implementing platform-specific enhancements for NVIDIA, AMD, and Intel architectures. Scalability was achieved through advanced load-balancing algorithms and improved inter-GPU communication, enabling efficient parallelization for large-scale simulations.
We present results demonstrating significant speedups and energy savings compared to CPU-based methods, highlighting the transformative potential of GPUs in enabling faster, more complex simulations.
Workshop
Accelerators
Emerging Technologies
Hardware Technologies
W
DescriptionThe RISC-V Instruction Set Architecture (ISA) has enjoyed phenomenal growth in recent years, however it still to gain popularity in HPC. Whilst adopting RISC-V CPU solutions in HPC might be some way off, RISC-V based PCIe accelerators offer a middle ground where vendors benefit from the flexibility of RISC-V yet fit into existing systems.
In this paper we focus on the Tenstorrent Grayskull PCIe RISC-V based accelerator which, built upon Tensix cores, decouples data movement from compute. Using the Jacobi iterative method as a vehicle, we explore the suitability of stencils on the Grayskull e150. We explore best practice in structuring these codes for the accelerator and demonstrate that the e150 provides similar performance to a Xeon Platinum CPU (albeit BF16 vs FP32) but the e150 uses around five times less energy. Over four e150s we obtain around four times the CPU performance, again at around five times less energy.
In this paper we focus on the Tenstorrent Grayskull PCIe RISC-V based accelerator which, built upon Tensix cores, decouples data movement from compute. Using the Jacobi iterative method as a vehicle, we explore the suitability of stencils on the Grayskull e150. We explore best practice in structuring these codes for the accelerator and demonstrate that the e150 provides similar performance to a Xeon Platinum CPU (albeit BF16 vs FP32) but the e150 uses around five times less energy. Over four e150s we obtain around four times the CPU performance, again at around five times less energy.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionIn this position paper we argue for standardizing how we share and process data in scientific workflows at the network-level to maximize step re-use and workflow portability across platforms and networks in pursuit of a foundational workflow stack. We look to evolve workflows from steps connected point-to-point in a directed acyclic graph (DAG) to steps connected via shared channels in a message system implemented as a network service. To start this evolution, we contribute: a preliminary reference model, architecture, and open tools to implement the architecture today. Our goal stands to improve the deployment and operation of complex workflows by decoupling data sharing and data processing in workflow steps. We seek the workflow community’s input on this approach’s merit, related research to explore and initial requirements from the workflows community to inform future research.
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
DescriptionTraditional scientific visualization pipelines transfer entire data arrays from storage to client nodes for processing into displayable graphics objects. However, this full data transfer is often unnecessary, as many visualization filters operate on only small subsets of data in a data array. With the rise of computational storage, smart NICs, and smart devices enabling offloaded processing, this paper examines a case where a visualization pipeline is divided into pre-filters that run near data and post-filters that execute on the client side. Pre-filters preprocess the data near it on storage nodes, reducing data volumes before transfer based on downstream pipeline needs, while post-filters complete the processing on the client node. Experiments done on two real-world simulation datasets demonstrate that this approach can significantly reduce network transfer volumes, cutting visualization pipeline data load times by up to 2.8X compared to traditional methods, and up to 11.9X when combined with data compression techniques.
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms
Scalable Data Mining
W
DescriptionGraph Neural Networks (GNNs) have been used to solve complex problems of drug discovery, social media analysis, etc. Meanwhile, GPUs are becoming dominating accelerators to improve deep neural network performance. However, due to the characteristics of graph data, it is challenging to accelerate GNN-type workloads with GPUs alone. GraphSAGE is one representative GNN workload that uses sampling to improve GNN learning efficiency. Profiling the GraphSAGE using PyG library reveals that the sampling stage on the CPU is the bottleneck. Hence, we propose a heterogeneous system architecture solution with the sampling algorithm accelerated on customizable accelerators (FPGA), and feed sampled data into GPU training through a PCIe Peer-to-Peer (P2P) communication flow. With FPGA acceleration, for the sampling stage alone, we achieve a speed-up of 2.38X to 8.55X compared with sampling on CPU.
For end-to-end latency, compared with the traditional flow, we achieve a speed-up of 1.24X to 1.99X.
For end-to-end latency, compared with the traditional flow, we achieve a speed-up of 1.24X to 1.99X.
Paper
Accelerators
Energy Efficiency
Facilities
Resource Management
State of the Practice
TP
DescriptionGPU has emerged as the go-to accelerator for HPC workloads; however, its power consumption has become a major limiting factor for further scaling HPC systems. An accurate understanding of GPU power consumption is essential for further improving its energy efficiency, and consequently reducing the associated carbon footprint. Despite the limited documentation and lack of understanding, NVIDIA GPUs' built-in power sensor is widely used in energy-efficient computing research. Our study seeks to elucidate the internal mechanisms of the power readings provided by nvidia-smi and assess the accuracy of the measurements. We evaluated over 70 different GPUs across 12 architectural generations, and identified several unforeseen problems that can lead to drastic under/overestimation of energy consumed, for example on the A100 and H100 GPUs only 25% of the runtime is sampled. We proposed several mitigations that could reduce the energy measurement error by an average of 35% in the test cases we present.
Workshop
Distributed Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
W
DescriptionInterconnection networks are key actors that condition the performance of current large data center and supercomputer systems. Both topology and routing are critical aspects that must be carefully considered for a competitive system network design. Moreover, when daily failures are expected, this tandem should exhibit resilience and robustness. Low-diameter networks, including HyperX, are cheaper than typical Fat Trees. But, to be really competitive, they have to employ evolved routing algorithms to both balance traffic and tolerate failures.
In this paper, SurePath, an efficient fault-tolerant routing mechanism for HyperX topology, is introduced and evaluated. SurePath leverages routes provided by standard routing algorithms and a deadlock avoidance mechanism based on an Up/Down escape subnetwork. This mechanism not only prevents deadlock but also allows for a fault-tolerant solution for these networks. SurePath is thoroughly evaluated in the paper under different traffic patterns, showing no performance degradation under extremely faulty scenarios.
In this paper, SurePath, an efficient fault-tolerant routing mechanism for HyperX topology, is introduced and evaluated. SurePath leverages routes provided by standard routing algorithms and a deadlock avoidance mechanism based on an Up/Down escape subnetwork. This mechanism not only prevents deadlock but also allows for a fault-tolerant solution for these networks. SurePath is thoroughly evaluated in the paper under different traffic patterns, showing no performance degradation under extremely faulty scenarios.
Workshop
Data Movement and Memory
Emerging Technologies
W
DescriptionThe need to support a large volume of transactions on shared data is increasing to meet explosive growth in worldwide data and processing demands. Emerging memory architectures such as CXL are increasing in popularity; CXL allows for dynamic demand-sensitive resizing of aggregated memory, support for heterogeneous memory types, and sharing of data amongst supported processors and devices. However, while this new memory architecture alleviates many concerns in datacenter and HPC architectures, data integrity when using memory-based transactions over CXL faces many challenges.
To solve for these challenges, we describe a novel solution for providing ACID (Atomicity, Consistency, Isolation, Durability) transactions in a CXL-based memory architecture. We call this solution Transactional CXL, or TCXL, which requires no changes to the existing processor microarchitectures and is implemented in a software library with a back-end controller that can be embedded in a CXL controller, as a stand-alone CXL device, or host implemented.
To solve for these challenges, we describe a novel solution for providing ACID (Atomicity, Consistency, Isolation, Durability) transactions in a CXL-based memory architecture. We call this solution Transactional CXL, or TCXL, which requires no changes to the existing processor microarchitectures and is implemented in a software library with a back-end controller that can be embedded in a CXL controller, as a stand-alone CXL device, or host implemented.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionActive learning algorithms, integrating machine learning, quantum computing and optics simulation in an iterative loop, offer a promising approach to optimizing metamaterials. However, these algorithms can face difficulties in optimizing highly complex structures due to computational limitations. High-performance computing (HPC) and quantum computing (QC) integrated systems can address these issues by enabling parallel computing. In this study, we develop an active learning algorithm working on HPC-QC integrated systems. We evaluate the performance of optimization processes within active learning (i.e., training a machine learning model, problem-solving with quantum computing, and evaluating optical properties through wave-optics simulation) for highly complex metamaterial cases. Our results showcase that utilizing multiple cores on the integrated system can significantly reduce computational time, thereby enhancing the efficiency of optimization processes. Therefore, we expect that leveraging HPC-QC integrated systems helps effectively tackle large-scale optimization challenges in general.
Workshop
Distributed Computing
Experimental Facility
HPC Infrastructure
W
DescriptionArtificial Intelligence, combined with simulations and experiments, has great potential in accelerating scientific discovery, yet bridging the gap between simulations and experiments remains challenging due to time and scale disparities. Our research addresses this issue by developing a deep kernel-based surrogate model that learns from microscopic images to map structural features to energy differences from defect formation. We begin with full training using simulated images to establish optimal settings and create a baseline for active learning. Active learning is then employed to predict structures along simulation trajectories based on uncertainty and stability, reducing data requirements and computational costs. The model shows a low average error margin of approximately 0.03 meV. A autoencoder-decoder was developed as additional surrogate to enhance feature extraction and reconstruction, achieving a reconstruction loss of around 0.2 and facilitating precise comparisons between simulations and experiments. This approach advances real-time experimental guidance through computational simulations.
Workshop
Debugging and Correctness Tools
Performance Evaluation and/or Optimization Tools
W
DescriptionA Fine-grained Asynchronous Bulk Synchronous Parallel (FA-BSP) model is an extended version of the existing BSP model that facilitates fine-grained asynchronous point-to-point messages with automatic message aggregation.
While there are many large irregular applications written with the FA-BSP model, demonstrating promising performance, no profiler is aware of profile-worthy portions of an FA-BSP program and visualizes the results in an intuitive way. This is reasonable because the FA-BSP program relies on multiple external libraries, and the runtime frequently switches between different portions of the program, which makes it difficult for well-established profilers like score-p, TAU, CrayPat, Vtune, and HPCToolkit to profile and visualize these portions in an FA-BSP-friendly manner.
This paper designs and implements a profiling and visualization framework called ActorProf. The framework enables 1) asynchronous point-to-point message-aware profiling with hardware performance counters, 2) overall performance breakdown that is aware of FA-BSP execution, and 3) visualization of these profiling results.
While there are many large irregular applications written with the FA-BSP model, demonstrating promising performance, no profiler is aware of profile-worthy portions of an FA-BSP program and visualizes the results in an intuitive way. This is reasonable because the FA-BSP program relies on multiple external libraries, and the runtime frequently switches between different portions of the program, which makes it difficult for well-established profilers like score-p, TAU, CrayPat, Vtune, and HPCToolkit to profile and visualize these portions in an FA-BSP-friendly manner.
This paper designs and implements a profiling and visualization framework called ActorProf. The framework enables 1) asynchronous point-to-point message-aware profiling with hardware performance counters, 2) overall performance breakdown that is aware of FA-BSP execution, and 3) visualization of these profiling results.
Paper
Accelerators
Algorithms
Artificial Intelligence/Machine Learning
Graph Algorithms
Performance Optimization
TP
DescriptionAttention-based models are proliferating in the space of image analytics, including segmentation. The standard method of feeding images to transformer encoders is to divide the images into patches and then feed the patches to the model as a linear sequence of tokens. For high-resolution images, e.g., microscopic pathology images, the quadratic compute and memory cost prohibits the use of an attention-based model. The solution is to either use complex multi-resolution models or approximate attention schemes. We take inspiration from Adapative Mesh Refinement (AMR) methods by adaptively patching the images, based on the image details to reduce the number of patches being fed to the model. This method has a negligible overhead and works seamlessly as a pre-processing step with any attention-based model. We demonstrate superior segmentation quality over widely used segmentation models for real-world pathology datasets while gaining a geomean speedup of 6.9\x for resolutions up to 64K^2.
Birds of a Feather
TP
XO/EX
DescriptionTestbeds play a vital role in assessing the readiness of novel architectures for upcoming supercomputers for the exascale and post-exascale era. These testbeds also act as co-design hubs, enabling the collection of application operational requirements, while identifying critical gaps that need to be addressed for an architecture to become viable for HPC. Various research centers are actively deploying testbeds, and our aim is to build a community that facilitates the sharing of information, encouraging collaboration and understanding of the available evaluation resources. This BoF will facilitate the exchange of best practices, including testbed design, benchmarking, system evaluation, and availability.