Posters
Tuesday, 19 November 2024 10am-5pm B301

Investigating Clustering Behavior in Fuels
Hide Details
SessionArt of HPC Display
DescriptionSimulation data was computed on resources of the Argonne Leadership Computing Facility, and rendered using the VMD software package. No ML tools were leveraged in the rendering.

Collatz Kaleidoscope
Hide Details
SessionArt of HPC Display
DescriptionThe image is generated wholly from code written in Python (version 3.11.5) using the visualization library Matplotlib (version 3.8.3) to programmatically and procedurally define the design. This underlying code has, along with the generated output design, been open-sourced under the CC BY 4.0 license and is available to view from a public GitHub repository, "high-res-art" under the artist's personal space (https://github.com/sadielbartholomew/high-res-art/blob/main/collatz_hi_res.py). High-performance computing was indispensable towards refining the parameters encoding the precise design, in particular the translation factors and set of hex-triplet color codes defining the background and marker colors, through batch processing of configurations of the parameters starting from exploratory values, followed by inspection of the generated outcomes and honing in on parameter sets producing promising outcomes in several iterations until finally this design emerged as a visual favorite. Specifically, the Slurm workload manager was used for such batch computing on the UK's JASMIN supercomputer.

Collatz Residuals
Hide Details
SessionArt of HPC Display
DescriptionThe image is generated wholly from code written in Python (version 3.11.5) using the visualization library "matplotlib" (version 3.8.3) to programmatically and procedurally define the design. This underlying code has, along with the generated output design, been open-sourced under the CC BY 4.0 license and is available to view from a public GitHub repository, "high-res-art" under the artist's personal space (https://github.com/sadielbartholomew/high-res-art/blob/main/collatz_hi_res.py). High-performance computing was indispensable towards refining the parameters encoding the precise design, in particular the translation factors and set of hex-triplet color codes defining the background and marker colors, through batch processing of configurations of the parameters starting from exploratory values, followed by inspection of the generated outcomes and honing in on parameter sets producing promising outcomes in several iterations until finally this design emerged as a visual favorite. Specifically, the Slurm workload manager was used for such batch computing on the UK's JASMIN supercomputer.

Arctic Wind Paintings
Hide Details
SessionArt of HPC Display
DescriptionThe data used comes from E3SM-MPAS, a global climate model run at Los Alamos National Laboratory. ParaView, an open-source scientific visualization system, was used to transform the raw data to renderable geometry. Artifact-Based Rendering (www.sculpting-vis.org), a research system developed by a collaboration between the University of Minnesota and the Texas Advanced Computing Center at the University of Texas at Austin, was then used to add artist-made data-driven visual attributes and render the scene. No ML programs were used.

Sunset on Wind Turbines
Hide Details
SessionArt of HPC Display
DescriptionThe two CFD simulations to compute the wake from the wind turbines were performed using the high-order finite-difference flow solver XCompact3D (https://www.incompact3d.com). The simulations were run on ARCHER2, the UK's national HPC service (https://www.archer2.ac.uk). A precursor simulation was run to generate the neutral atmospheric boundary layer. The wind turbine simulations were run with a billion mesh points for 20,000 iterations before collecting the flow field.
We used ParaView for the initial postprocessing of the CFD simulation's velocity data to compute the Q-criterion. The final video was made using Blender, including the simulation of the ocean. The representation of the sky uses an HDRi from Poly Haven, "The Sky Is On Fire" by Greg Zaal and Rico Cilliers (see https://polyhaven.com/a/the_sky_is_on_fire).
The video has a total of 590 frames and took 14 hours to render on a workstation with an AMD Ryzen Threadripper PRO 7985WX (64 cores), six NVIDIA GeForce RTX 4090 GPUs (512GB DDR5), and four Samsung 990 PRO 4TB NVMe SSDs.
We used ParaView for the initial postprocessing of the CFD simulation's velocity data to compute the Q-criterion. The final video was made using Blender, including the simulation of the ocean. The representation of the sky uses an HDRi from Poly Haven, "The Sky Is On Fire" by Greg Zaal and Rico Cilliers (see https://polyhaven.com/a/the_sky_is_on_fire).
The video has a total of 590 frames and took 14 hours to render on a workstation with an AMD Ryzen Threadripper PRO 7985WX (64 cores), six NVIDIA GeForce RTX 4090 GPUs (512GB DDR5), and four Samsung 990 PRO 4TB NVMe SSDs.

Cosmic Breeze
Hide Details
SessionArt of HPC Display
DescriptionThis animation consists of three primary elements. The first is the evolving render of the 3D structure of the coronal magnetic field. It is created from 776 sequences from the “Live Prediction” HPC simulation developed by Predictive Science Inc. as part of their scientific research and outreach for the April 8, 2024 eclipse. See predsci.com/eclipse2024 for more details on the eclipse simulation, which was created using the MAS code (predsci.com/mas) running on several thousand processors. The volume rendering is used for scientific purposes and is generated using parallelized Fortran tools. These map millions of magnetic field lines from 3D model data and compute a scene by integrating a complexity indicator (the squashing factor) along parallel lines of sight. The only difference between this and the pure science version are the resolution and a very light unsharp masking to enhance contrast. The second is an image of the moon obtained using the LROC Quickmap tool (quickmap.lroc.asu.edu) based on public NASA/LRO data. The third is the slowly moving starfield cropped from a publicly available observation/image from the ESA/Hubble archive [Credit: NASA, ESA and Jesús Maíz Apellániz (Instituto de Astrofísica de Andalucía, Spain), esahubble.org/images/heic1011a]. A gamma correction is applied to the original image data. The three elements are then composited using “screen” blending. The background music is “Fireflies" by Ambient Boy, obtained from uppbeat.io.

Virtual Expedition through the Ice of Greenland's Glaciers
Hide Details
SessionArt of HPC Display
DescriptionThis analysis requires a large amount of raw data: detailed models of the 3D topography of the sea floor, satellite imagery to monitor glacier change over time, and physical samples of the ice and sediment. These datasets need to be mapped to a single consistent geographic framework for comparison. For example, lower resolution (100m) satellite-derived data of the entire glacier research site needed to be superimposed onto the higher resolution (5m) bathymetry ocean bottom multibeam (sonar) ship tracks close to the glacier terminus. Physical samples of ice and sediment will also be collected during the expedition and will need to be represented in this common georeferenced space.
The conglomerate topography/bathymetry of Greenland and glacier terminus endpoints over time were imported into a 3D visualization application (ParaView). The coverage for the high-resolution data is only a fraction of the total area, so care must be taken to represent the data appropriately and color-maps constructed to represent the various categories (ocean, land, ice) represented. The portions of interest are exported for 3D exploration in Unity via the Artifact-Based Rendering (ABR) plugins running in Unity and ParaView. The ABR engine allows for the complex scientific visualizations from ParaView to be piped to the Unity engine in real time and gives the user control over key color and textural representations. The hand-tracking/gestural navigation developed in Unity enables users to easily explore the data and approach areas of scientific interest for closer inspection. Users can export the exploration path as an animation for collaboration.
The conglomerate topography/bathymetry of Greenland and glacier terminus endpoints over time were imported into a 3D visualization application (ParaView). The coverage for the high-resolution data is only a fraction of the total area, so care must be taken to represent the data appropriately and color-maps constructed to represent the various categories (ocean, land, ice) represented. The portions of interest are exported for 3D exploration in Unity via the Artifact-Based Rendering (ABR) plugins running in Unity and ParaView. The ABR engine allows for the complex scientific visualizations from ParaView to be piped to the Unity engine in real time and gives the user control over key color and textural representations. The hand-tracking/gestural navigation developed in Unity enables users to easily explore the data and approach areas of scientific interest for closer inspection. Users can export the exploration path as an animation for collaboration.

Learning from the Sky Using the Hardware/Hybrid Accelerated Cosmology Code (HACC)
Hide Details
SessionArt of HPC Display
DescriptionThe image was created using ParaView, from data computed on the Aurora supercomputer by the HACC collaboration.

Visualization of a CM1 Cloud Simulation
Hide Details
SessionArt of HPC Display
DescriptionThe simulation data was generated from a three-hour simulation at 7.5 m grid-spacing (domain size of 10 km x 10 km x 8 km) of a precipitating cumulus congestus cloud using Cloud Model 1 (CM1) with Lagrangian microphysics on NSF NCAR's Derecho supercomputer. Model output, in the form of NetCDF files, was converted to OpenVDB volume files, which were then read directly into Blender 3D animation software.
Materials, lighting, and camera motion for the pan/zoom sequence were all applied in Blender. The animation was rendered on an NVIDIA GPU using NVIDIA OptiX via the Blender Cycles rendering engine. The Derecho supercomputer model was also rendered in Blender from a textured mesh. Post-production, including adding text and logos, was performed in Adobe Premiere.
Materials, lighting, and camera motion for the pan/zoom sequence were all applied in Blender. The animation was rendered on an NVIDIA GPU using NVIDIA OptiX via the Blender Cycles rendering engine. The Derecho supercomputer model was also rendered in Blender from a textured mesh. Post-production, including adding text and logos, was performed in Adobe Premiere.
Event Type
Art of HPC
Posters
TimeTuesday, 19 November 202410am - 5pm EST
LocationB301
TP
W
TUT
XO/EX
Similar Presentations

Challenges of Finding Cyber Attacks and Remediating Network Issues
Hide Details
SessionArt of HPC Display
DescriptionThe packets were captured at a network border of 400 Gbps links. The raw packet data were preprocessed in Python to generate bipartite (pair-wise) TCP connections. Gephi put the TCP connections in a graph and visualized them. The original picture was 16K resolution (15360 × 8640), containing a snapshot of 2.29 M TCP connections visualized by Gephi using the Yifan Hu graph layout algorithm. Further processing was done in PowerPoint to highlight the area of interest. Several people were involved in data collection (Alex Withers), rendering (Bach Hoang), leading the process (Phuong Cao), and providing concept and feedback (Ravi Iyer, Zbigniew Kalbarczyk).

The Heart of HPC
Hide Details
SessionArt of HPC Display
DescriptionThis image was created using a Canon EOS 7, a Google Pixel phone, Adobe Photoshop, Adobe Premiere, NCAR stock video, and ThingLink.
Event Type
Art of HPC
Posters
TimeTuesday, 19 November 202410am - 5pm EST
LocationB301
TP
W
TUT
XO/EX
Similar Presentations

NCSA Delta Wrap Design
Hide Details
SessionArt of HPC Display
DescriptionThe design was created with Adobe Creative Suite and purchased design assets from iStock. It was designed by NCSA staff with no AI or ML use.

48 Hours of Kestrel Jobs
Hide Details
SessionArt of HPC Display
DescriptionThis artifact was created with Pandas, Matplotlib, and NetworkX. Data was gathered from the Kestrel cluster at the National Renewable Energy Laboratory with Slurm via the sacct and sinfo commands.

Impressions of PIConGPU Using Watercolors
Hide Details
SessionArt of HPC Display
DescriptionThe image was painted using Arteza watercolors and brushes on a cold-pressed watercolor paper, and was captured and uploaded via an iPhone.

Blood Flow through a Microaneurysm
Hide Details
SessionArt of HPC Display
DescriptionWe used the Coreform Cubit software to create an Exodus-II tri-mesh with 11,196 points for the blood vessel walls. Red blood cells are placed randomly within the mesh bounds, and then the algorithm from RBC3D, a spectral boundary integral solver for cell-scale flows, initiates Stokes flow through the vessel. This algorithm is parallelized via MPI, and we had to use 192 CPU cores for eight hours to run this simulation to 10,000 timesteps. To visualize the simulation data, we used Kitware's ParaView software. Then, we used two NVIDIA RTX 6000 GPUs to run the OSPRay path tracer algorithm from ParaView's ray-tracing tools. Georgia Tech's PACE Phoenix cluster provided access to CPU and GPU nodes under Spencer Bryngelson's allocation. The ray-tracing step took 16 hours on these nodes. Finally, we combined images of the simulation from each timestep into a video using FFmpeg.

HPC Creates Community: Women in HPC
Hide Details
SessionArt of HPC Display
DescriptionThis image was created utilising publicly available photography from the SC21, SC22, and SC23 workshops and networking events. Images were digitally cut and arranged utilising Canva with additional graphics support made in Adobe Illustrator.

Enhancing Brain Flow Visualization
Hide Details
SessionArt of HPC Display
DescriptionThe visualization pipeline was developed to process the IFF data using ParaView. A blend between surfacic approximation and volumetric multi-scattering was used to generate the two volumes of the brain and the tumor. The particles passed through the voxels of interest that were represented as spheres with point Gaussian oriented by the velocity vector and colored by the path line density value.
The visualization required the use of MPI and HPC in order to develop the volume rendering of the image stacks. The framework was established by using EGL ParaView in a server-client setup, using a Dell PowerEdge R7525 (2U) with a Dual AMD 7502 (32C/64T) CPU, 512GB RAM (16x32GB), one NVIDIA Ampere A40 GPU (48 GB VRAM), and about 3.2TB of local storage.
The visualization required the use of MPI and HPC in order to develop the volume rendering of the image stacks. The framework was established by using EGL ParaView in a server-client setup, using a Dell PowerEdge R7525 (2U) with a Dual AMD 7502 (32C/64T) CPU, 512GB RAM (16x32GB), one NVIDIA Ampere A40 GPU (48 GB VRAM), and about 3.2TB of local storage.

Undulations in Rotation
Hide Details
SessionArt of HPC Display
DescriptionThe image is generated wholly from code written in Python (version 3.11.5) using the visualization library "matplotlib" (version 3.8.3) to programmatically and procedurally define the design. This underlying code has, along with the generated output design, been open-sourced under the CC BY 4.0 license and is available to view from the public GitHub repository "high-res-art" under the artist's personal space (https://github.com/sadielbartholomew/high-res-art/blob/main/inspired_by_le_parc_hi_res.py). High-performance computing was indispensable towards refining the parameters encoding the precise design, in particular the radii of the two aligned wedges forming the element under repeated rotation; the number of patches per side; and the rotational array. The Slurm workload manager was used on the UK's JASMIN supercomputer to batch process such configurations of the parameters starting from exploratory values, followed by inspection of the generated outcomes and honing in on parameter sets producing promising outcomes in several iterations until finally this design emerged as a visual favorite.

Complex Plasma Crystal Rose
Hide Details
SessionArt of HPC Display
DescriptionThe simulation was written in C/C++ and CUDA. The computationally intensive portions of the code were offloaded to NVIDIA GPUs for acceleration. The graphics were rendered using OpenGL.

Magnetized Bipolar Jet
Hide Details
SessionArt of HPC Display
DescriptionThe simulation data was produced using the AthenaPK code (https://github.com/parthenon-hpc-lab/athenapk) using the Frontier supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). The data is a time snapshot from the AthenaPK simulation of a galaxy cluster with a virial mass of ~6.6e14 solar masses and a central supermassive black hole of ~1.1e9 solar masses. The data was then visualized on the Andes cluster at OLCF using VisIt (https://github.com/visit-dav/visit). VisIt was used to generate an isosurface of the jet and the streamlines of the magnetic field. Both the isosurfaces and streamlines were exported to OBJ files to then be further visualized on Frontier using Blender (https://www.blender.org/). Blender was used to import both OBJ files and make the final PNG render of the objects using the Cycles render engine.

Preparing to Battle Cancer at the Exascale
Hide Details
SessionArt of HPC Display
DescriptionSimulation data was computed on resources of the Argonne Leadership Computing Facility, and rendered using ParaView. No ML tools were leveraged in the rendering.

Connections in Rotation
Hide Details
SessionArt of HPC Display
DescriptionThe image is generated wholly from code written in Python (version 3.11.5) using the visualization library Matplotlib (version 3.8.3) to programmatically and procedurally define the design. This underlying code has, along with the generated output design, been open-sourced under the CC BY 4.0 license and is available to view from a public GitHub repository, "high-res-art" under the artist's personal space (https://github.com/sadielbartholomew/high-res-art/blob/main/inspired_by_le_parc_hi_res.py). High-performance computing was indispensable towards refining the parameters encoding the precise design, in particular the radius, width and transparency of the individual patch forming the element under repeated rotation; the number of patches per side; and the rotational start and end points forming the pair of rotational arrays. The Slurm workload manager was used on the UK's JASMIN supercomputer to batch process such configurations of the parameters starting from exploratory values, followed by inspection of the generated outcomes and honing in on parameter sets producing promising outcomes in several iterations until finally this design emerged as a visual favorite.

Birth of a Neutron Star from a 25-Solar-Mass Star
Hide Details
SessionArt of HPC Display
DescriptionSimulation data was computed on resources of the Argonne Leadership Computing Facility, and rendered using ParaView. No ML tools were leveraged in the rendering.

Currents in the Gulf
Hide Details
SessionArt of HPC Display
DescriptionThe data is processed in ParaView and then transferred to Artifact-Based Rendering, a custom-built visualization system designed for artists that enables one to apply custom artifacts to large multivariate volumetric data. Details about Artifact-Based Rendering can be found at www.sculpting-vis.org.
Prior work within Artifact-Based Rendering has focused on colormaps and glyphs. Here we have turned our attention to the potential of line textures to distinguish different categories of streamlines and enrich their visual impact.
All the visual encodings of the data were hand-generated and applied by the artist via the Artifact-Based Rendering interface. No ML was used to create this image.
Prior work within Artifact-Based Rendering has focused on colormaps and glyphs. Here we have turned our attention to the potential of line textures to distinguish different categories of streamlines and enrich their visual impact.
All the visual encodings of the data were hand-generated and applied by the artist via the Artifact-Based Rendering interface. No ML was used to create this image.

"What's Going On in There?" A View into NREL's Kestrel Supercomputer
Hide Details
SessionArt of HPC Display
DescriptionThe recorded visualization was built with JavaScript using the D3 and Anime.js libraries. Historical run data from the Kestrel supercomputer was queried using SQL from NREL's internal sys admin database and bundled into a JSON file for use by the JavaScript code. The JSON file was organized by minute-long time-steps, each signifying the state of the jobs in a particular case on the supercomputer at a particular point in time.
The video itself is a screen recording of the visualization being run on a MacBook Pro with an Intel i9 2.3 GHz 8-Core processor and an AMD Radeon Pro 5500 Graphics Card. The screen recording software is the default Quicktime Software and it was edited for length using Adobe Premiere Pro.
The video itself is a screen recording of the visualization being run on a MacBook Pro with an Intel i9 2.3 GHz 8-Core processor and an AMD Radeon Pro 5500 Graphics Card. The screen recording software is the default Quicktime Software and it was edited for length using Adobe Premiere Pro.

Carbon Catcher
Hide Details
SessionArt of HPC Display
DescriptionUsing GFlowNets, we generate porous reticular materials, such as Metal-Organic Frameworks and Covalent Organic Frameworks, for applications in carbon dioxide capture. We introduce a new Python package (matgfn) to train and sample GFlowNets. We use matgfn to generate the matgfn-rm dataset of novel and diverse reticular materials with gravimetric surface area above 5000 m2 g−1. We calculate single- and two-component gas adsorption isotherms for the top 100 candidates in matgfn-rm. These candidates are novel compared to the state-of-the-art ARC-MOF dataset and rank in the 90th percentile in terms of working capacity compared to the CoRE2019 dataset. We identify 13 materials with CO2 working capacity outperforming all materials in CoRE2019. After further analysis and structural relaxation, two outperforming materials remain (https://pubs.rsc.org/en/content/articlelanding/2024/dd/d4dd00020j).
Once the .xyz files were created, they were then handed over to the visualisation team, where they were imported into VMD. Once in VMD, the Van Der Waals graphical representation was chosen and then imported into Blender. In Blender, the lighting, materials, textures and environment were all altered to show the MOFs in an artistic way.
Once the .xyz files were created, they were then handed over to the visualisation team, where they were imported into VMD. Once in VMD, the Van Der Waals graphical representation was chosen and then imported into Blender. In Blender, the lighting, materials, textures and environment were all altered to show the MOFs in an artistic way.
Event Type
Art of HPC
Posters
TimeTuesday, 19 November 202410am - 5pm EST
LocationB301
TP
W
TUT
XO/EX
Similar Presentations

Complexity Revealed
Hide Details
SessionArt of HPC Display
DescriptionThe coronal calculation was performed with the open source POT3D code (github.com/predsci/pot3d) using high-resolution surface magnetic field observations from the Helioseismic and Magnetic Imager as a lower boundary condition. The solution was computed on the Stampede2 supercomputer at the Texas Advanced Computing Center using 128 48-core compute nodes. Over 100 million magnetic field lines were then traced through the 6.6 billion cell 3D solution using the open source MapFL code (github.com/predsci/mapfl) to calculate the “magnetic squashing factor” (an indicator of magnetic structure) and determine which magnetic field lines were open (extending out into the heliosphere) or closed (falling back to the Sun). The image is a layered composite of six images of three signed quantities (squashing factor in purple [+] and green [-], magnetic field strength in orange [+] and cyan [-], and whether the field is open [red] or closed [dark blue]). The six images are blended using alpha transparency mapping, but otherwise are not altered from the original raw quantities (no sharpening or artistic enhancements are applied).
Event Type
Art of HPC
Posters
TimeTuesday, 19 November 202410am - 5pm EST
LocationB301
TP
W
TUT
XO/EX
Similar Presentations

NCSA Granite Wrap Design
Hide Details
SessionArt of HPC Display
DescriptionCreated with Adobe Creative Suite and purchased design assets from iStock. Designed by NCSA staff. No AI or ML use.

Destroy!
Hide Details
SessionArt of HPC Display
DescriptionThis image was created on the Ohio Supercomputer Center’s Pitzer cluster using Stable Diffusion Automatic1111 (open source) using SD-XL checkpoint with the following parameters and prompt: "Destruction, (one artist:1.3) pulling strings from walls, war, dark, good quality, masterpiece"; negative prompt: "painting". Steps: 50. Sampler: DPM++ 2M SDE Karras. CFG scale: 7. Size: 1920x1080. Model hash: 31e35c80fc. Model: sd_xl_base_1.0. Version: v1.5.1.
Tuesday, 19 November 2024 12pm-5pm B302-B305

Interactive and Tool-Agnostic ML-Driven Workflow for Automated HPC Performance Modeling
Hide Details
DescriptionThis work presents an automated, reproducible, ML-based performance modeling workflow for HPC systems. The proposed workflow fully automates data generation, preprocessing, ML model training and validation. Since the proposed approach is generic and not tailored to a specific application, our workflow can be utilized for performance modeling across a wide range of performance domains. The prototype implementation is based on the JUBE workflow environment, through which a user-friendly interactive console is realized. The effectiveness of the automated workflow is demonstrated with a case study on I/O bandwidth modeling and prediction.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

ORCHA: A Performance Portability System for Flash-X — A Multiphysics Application Software
Hide Details
DescriptionHeterogeneous platforms demand tailored data structures, algorithm mappings, and efficient execution, often leading to numerous, hard-to-maintain variants of source codes. ORCHA, our performance portability orchestration system, addresses these challenges through abstractions and code generation, streamlining application development across diverse hardware.
Designed to adapt the FLASH multiphysics software for heterogeneous HPC platforms, ORCHA reduces code duplication and maintenance burdens by separating data management and parallelism from arithmetic logic. Key tools include CG-Kit for optimizing implementations, a macroprocessor for flexible arithmetic specialization, and Milhoja, a runtime for efficient graph execution.
This poster highlights performance evaluations of a shock hydrodynamics application across various hardware configurations, showcasing significant GPU performance improvements with ORCHA. We will also outline ongoing efforts to extend ORCHA's compatibility to other physics solvers, aiming to provide broader flexibility and enhanced performance in diverse computational environments.
Designed to adapt the FLASH multiphysics software for heterogeneous HPC platforms, ORCHA reduces code duplication and maintenance burdens by separating data management and parallelism from arithmetic logic. Key tools include CG-Kit for optimizing implementations, a macroprocessor for flexible arithmetic specialization, and Milhoja, a runtime for efficient graph execution.
This poster highlights performance evaluations of a shock hydrodynamics application across various hardware configurations, showcasing significant GPU performance improvements with ORCHA. We will also outline ongoing efforts to extend ORCHA's compatibility to other physics solvers, aiming to provide broader flexibility and enhanced performance in diverse computational environments.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Improving Polyhedral-Based Optimizations with Dynamic Coordinate Descent
Hide Details
DescriptionPolyhedral optimizations have been a cornerstone of kernel optimization for many years. These techniques use a geometric model of loop iterations to enable transformations like tiling, fusion, and fission. The elegance of this approach lies in its ability to produce highly efficient code through fully static optimizations. However, modern kernel schedulers typically avoid the polyhedral model, opting instead for dynamic sampling techniques, such as evolutionary searches, to generate efficient code. The polyhedral model is often bypassed because, being entirely static, it struggles to adapt to the fine details of hardware. In this work, we demonstrate that it is possible to overcome this limitation by combining the polyhedral model with a post-optimization phase based on dynamic coordinate descent, which uses minimal sampling while still achieving excellent performance.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Performance Engineering and Mesoscale-Microscale Coupling for Wind Energy Simulations
Hide Details
DescriptionWind farm simulations require data from mesoscale atmospheric simulations as initial and boundary conditions for the microscale turbine environments. The Energy Research and Forecasting (ERF) code bridges this scale gap and provides an efficient GPU-enabled parallel implementation with adaptive mesh refinement through the underlying AMReX framework. This poster outlines strategies that reduce the communication overhead among parallel processes through run time settings or systemic changes in the memory management. This includes using shared-memory parallelism over CPUs, enabling direct GPU-GPU data transfers, and implementing a separate memory pool on the GPU for communication buffers. I will present the performance scaling and improvements from these techniques as part of the poster. We are currently developing an in-memory coupling of the compressible flow ERF code with the incompressible turbine solver ExaWind, which will allow holistic wind farm simulations.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Establishing Best Practices for Applying Inline Compressed Arrays to Improve Performance in HPC
Hide Details
DescriptionHPC applications require massive amounts of memory to process large datasets. While data compression is used to avoid bottlenecks in transmission and storage, it is still necessary to decompress this data into memory to use it. Inline compressed arrays (ICA) is a method which keeps the data compressed in application memory, decompressing blocks of data as needed. The goal is to reduce the memory footprint of big-data applications, allowing them to run on more abundant HPC nodes with less DRAM. This research uses matrix multiplication as a lens for analyzing the effects of various ICA parameters on runtime and memory usage. We construct a model for a minimum number of compressor calls needed to complete the computation, and show how careful tuning of ICA parameters achieves this minimum. Finally, we briefly discuss how our lessons learned impact other computational kernels.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Stalls and Memory Analysis on Fujitsu A64FX and NVIDIA Grace
Hide Details
DescriptionARM-based multicore CPUs, such as NVIDIA Grace and Fujitsu A64FX, dominate contemporary HPC, featuring 32-256 cores with cache hierarchies and up to 1 TB/s memory bandwidth. While benchmarks like STREAM show similar performance across these systems, diverse applications, particularly graph and nearest-neighbor (e.g., stencils), reveal distinct performance profiles. Analyzing these profiles with low-level performance data can uncover system bottlenecks. We propose a template focusing on stalls and memory accesses to identify bottlenecks efficiently by studying key CPU/memory performance events using Linux perf. Our approach engages all cores (144 for Grace, 48 for A64FX) with platform-specific compilers (ARMClang 24.04 for Grace, Fujitsu 4.10 for A64FX). This method effectively categorizes application scenarios by analyzing stalls and memory accesses, enabling quick identification of corner cases.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

FortranX: Harnessing Code Generation, Portability, and Heterogeneity in Fortran
Hide Details
DescriptionDue to its historical popularity, Fortran was used to implement many important scientific applications. The complexity of these applications, along with the transition to modern high performance languages like C++, has made modernization and optimization challenging for these applications. Significant development time is incurred to understand and optimize key algorithms as well as leverage new accelerator systems. To reduce this development effort, we propose FortranX, a compiler framework to discover and optimize key algorithms in Fortran applications without source code modification. FortranX uses a compiler pass to recognize key algorithms, a code generation system to produce architecturally optimized kernels, and a heterogeneous runtime system to execute those kernels on various hardware platforms. We describe the design of FortranX and show initial performance results for a cyclic convolution kernel used in Poisson solvers for Partial Differential Equations (PDEs).
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Hardware-Independent Sampling Library for CPUs and (Multi-)GPUs: hws
Hide Details
DescriptionTo be energy efficient and fully utilize modern hardware, it is important to gain as much insight as possible into the performance and efficiency of an application. Especially in the age of artificial intelligence, it gets more and more important to keep, e.g., track of the total energy consumption of an application. However, gathering this hardware information in a vendor-independent and portable way is far from trivial.
Therefore, we propose the small, easy-to-use hardware sampling library "hws" for Python and C++, which makes it extremely easy to gather hardware information like CPU/GPU utilization, clock frequencies, power and memory consumption, or temperatures for CPUs as well as GPUs from NVIDIA, AMD, and Intel.
We further demonstrate the usefulness of our sampling library on the example of PLSSVM, a (multi-)GPU LS-SVM implementation.
Therefore, we propose the small, easy-to-use hardware sampling library "hws" for Python and C++, which makes it extremely easy to gather hardware information like CPU/GPU utilization, clock frequencies, power and memory consumption, or temperatures for CPUs as well as GPUs from NVIDIA, AMD, and Intel.
We further demonstrate the usefulness of our sampling library on the example of PLSSVM, a (multi-)GPU LS-SVM implementation.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Fault-Tolerant Numerical Iterative Algorithms at Scale
Hide Details
DescriptionNumerical iterative algorithms are struck by multiple error types when deployed on large-scale HPC platforms: fail-stop errors (failures) and silent errors, striking both as computation errors and memory bit-flips. Our novel approach provides efficient fault-tolerant algorithms that are capable of detecting and correcting them simultaneously. Previous works never addressed all the error types simultaneously.
We introduce a hierarchical periodic pattern combining various general-purpose and application-specific techniques and optimize its shape in order to minimize the expected time per iteration. The derivation is intricate because optimizing a resilience period for one error type depends upon other errors possibly striking and slowing down execution progress.
A case study with the preconditioned conjugate gradient algorithm (PCG) demonstrates the good performance and flexibility of our approach, which easily adapts to different application and fault-tolerance parameter costs (e.g. iteration, verification, checkpoint, etc.).
Future work: extension to include more case studies.
We introduce a hierarchical periodic pattern combining various general-purpose and application-specific techniques and optimize its shape in order to minimize the expected time per iteration. The derivation is intricate because optimizing a resilience period for one error type depends upon other errors possibly striking and slowing down execution progress.
A case study with the preconditioned conjugate gradient algorithm (PCG) demonstrates the good performance and flexibility of our approach, which easily adapts to different application and fault-tolerance parameter costs (e.g. iteration, verification, checkpoint, etc.).
Future work: extension to include more case studies.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Exploration of Super-Resolution Techniques for Image Compression
Hide Details
DescriptionTEZIP is a (de)compression framework leveraging PredNet, a deep neural network designed for video prediction tasks, to exploit temporal locality in time-evolving data. This study evaluates video super-resolution (VSR) models, which enhance low-resolution images by reconstructing high-resolution ones, under various compression and size reduction techniques. Specifically, we evaluate the VRT and BasicVSR++ models across various compression techniques, including H.264 and H.265, applied to the Vimeo90K dataset. Our results, evaluated using common super-resolution image quality metrics, indicate that the VRT model consistently outperforms BasicVSR++, particularly with H.264 and H.265 compressions. We observe that larger file sizes and lower compression ratios correlate with higher PSNR and SSIM values, highlighting the trade-offs between compression techniques and quality metrics in generating high-resolution images. These findings emphasize the balance needed between compression efficiency and image quality in VSR applications.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Seesaw: Elastic Scaling for Task-Based Distributed Programs
Hide Details
DescriptionModern batch schedulers in HPC environments enable the shared use of available computational resources via provisioning discrete sets of resources matching user requirements. The lack of elasticity in such scenarios is often addressed using a Pilot job model where multiple separate requests are pooled. In this work, we explore computational elasticity in a popular Python-based workflow system: Parsl. We identify limitations in existing scaling logic and propose a new resource-aware scheduler. We show a significant improvement in the efficiency of compute resources consumed with minimal loss in time to solution.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

SanQus: Staleness and Quantization-Aware Full-Graph Decentralized Training in GNNs
Hide Details
DescriptionGraph neural networks (GNNs) have demonstrated significant success in modeling graphs; however, they encounter challenges in efficiently scaling to large graphs. To address this, we propose the SanQus system, advancing our previous work, Sancus. SanQus reduces the need for expensive communication among distributed workers by utilizing Staleness and Quantization-Aware broadcasting. SanQus manages embedding staleness, skips unnecessary broadcasts, and treats decentralized GNN processing as sequential matrix operations. To further reduce communication, SanQus caches historical embeddings and performs quantization-aware broadcast. Theoretically, SanQus demonstrates bounded approximation errors and optimal convergence rates. Extensive experiments on big graphs with common GNN models show that SanQus reduces communication by up to 86% and triples throughput without sacrificing accuracy, outperforming state-of-the-art systems.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Parallel Verification of Neural Networks Applied to Medical Imaging
Hide Details
DescriptionNeural network verification provides model robustness guarantees in the presence of noise. We generate verification specifications for medical imaging models based on the U-Net architecture and solve pixel-by-pixel verification problems on a massive scale. Efficiency of this NP-complete problem is studied using α,β-CROWN. We implement parallelization per pixel and demonstrate orders-of-magnitude speedup allowing faster characterization or increased timeout values for greater solving capability.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Power Patterns: Understanding the Energy Dynamics of I/O for Parallel Storage Configurations
Hide Details
DescriptionAs HPC applications become more I/O intensive, understanding their power consumption patterns is necessary to develop energy-saving solutions. Here, we evaluate the energy consumption of I/O operations on two popular HPC parallel file systems: Lustre and DAOS. We develop models to predict the energy usage of sequential writes and evaluate their accuracy against our gathered benchmarks. Our models can be used to enhance the accuracy of energy-predicting frameworks by allowing them to consider storage configuration when estimating total energy consumption.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Meteorologic Real-Time Extreme Learning Machine for Pressure Prediction
Hide Details
DescriptionSignificant advances in weather prediction stemmed primarily from combining observational data, sophisticated modeling techniques, and analysis of simulated or historical weather data. Additionally, the applicability of machine learning-based applications on edge devices has expanded, addressing a variety of use cases, including weather prediction. This study builds on prior research using an Extreme Learning Machine (ELM) approach to detect weather anomalies in real-time, enhancing existing predictive systems. Our model is implemented on IBIS, an adaptable edge computing framework for multi-sensor data collection. The ELM model, applied to detect real-time weather anomalies, offers fast training and operational efficiency. Data on atmospheric phenomena, including pressure and wind, was generated and stored in a time series database. The model was then trained on 80% of 550,000 records. Our experiments demonstrated a 92% R2_score, supporting its effectiveness. Our work within IBIS represents a cost-effective and scalable solution for collecting, monitoring, and predicting hazardous atmospheric conditions.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

On the Accuracy and Efficiency of Approximate Triangle Counting via Randomized Numerical Linear Algebra
Hide Details
DescriptionWe study two algorithmic approaches to approximate triangle counting and compare their accuracy and efficiency. The first one is based on randomized matrix-matrix multiplication, which can be faster, simpler, and more parallelizable on modern processors. The second is based on trace estimation, which produces estimates with lower variance and greater accuracy.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Optimal Client Selection Algorithms for Federated Learning
Hide Details
DescriptionDue to the heterogeneity of resources and data, client selection plays a paramount role in the efficacy of Federated Learning (FL) systems. The time taken by a training round is determined by the slowest client. Also, energy consumption and carbon footprint are seen as primary concerns. In this context, we propose two optimal time- and energy-aware client selection algorithms for FL: MEC and ECMTC. To the best of our knowledge, this work is the first to propose algorithms that make an optimal selection of clients with heterogeneous resources by jointly optimizing the execution time and energy consumption while defining how much data each client should use locally.
During the presentation, I will expose the challenges of selecting clients in FL systems, present our approach based on an illustrative example, and then show the experimental evaluation carried out in an HPC platform and the takeaway of our investigation.
During the presentation, I will expose the challenges of selecting clients in FL systems, present our approach based on an illustrative example, and then show the experimental evaluation carried out in an HPC platform and the takeaway of our investigation.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX
Similar Presentations

Machine Learning Applications for Early-Stage Ovarian Cancer Diagnosis
Hide Details
DescriptionOvarian cancer (OC) significantly impacts women's health, and despite its prevalence, remains without a definitive cure. Early detection is crucial for improving treatment outcomes and reducing mortality rates and healthcare system costs. Leveraging advancements in machine learning, our study seeks to empower physicians with tools for more confident and timely diagnosis. This study introduces a novel approach using machine learning to enhance early-stage OC diagnosis. We propose the Data Driven Diagnosis Framework (DDD), a new feature extraction and ensemble method that improves classification accuracy. Using models such as Random Forest, Logistic Regression, Decision Tree, Gradient Boosting Machine, Extreme Gradient Boosting Machine, and language models, our approach shows accuracy improvements of up to 14%-28% over state-of-the-art methods.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

A Comparison Study of Open Source LLMs for HPC Ticket Answering
Hide Details
DescriptionWe are designing an automatic ticket answering service for computing centers such as Texas Advanced Computing Center (TACC), National Center for Supercomputing Applications (NCSA), and San Diego Supercomputer Center (SDSC). In this work, we investigate the capability and feasibility of open source language models (LLMs) on the ticket answering task. We compare four open source LLMs (OPT-6.7B, Falcon-7B, Llama 2-7B, and Llama 3.1-8B) by fine-tuning them with a curated dataset with over 110,000 historical question/answer pairs. Our results show that fine-tuned LLMs are capable of generating reasonable answers. Llama-7B has a lower validation loss and perplexity than OPT-6.7B and Falcon-7B. We also observe that fine-tuning with LoRA introduces non-trivial generalization loss compared with dense fine-tuning. We will design an evaluation dataset and perform quantitative evaluation for the three LLMs in the future.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

MatRIS: Performance Portable Math Library of IRIS Runtime for Multi-Device Heterogeneity
Hide Details
DescriptionWe present the recent efforts for MatRIS, the performance portable math library of IRIS runtime for multi-device heterogeneity. MatRIS provides dense linear algebra — BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage) — capabilities across different back ends available in the IRIS runtime, enabling the same MatRIS code to run efficiently on multi-device heterogeneous targets. MatRIS provides standard BLAS/LAPACK APIs. The motivation (philosophy) of MatRIS is: Implement once, deploy anywhere. The algorithms implemented in MatRIS are serial-like and architecture-agnostic, elevating the programming productivity in heterogeneous systems. While ensuring portability, MatRIS provides competitive or even better performance than state-of-the-art open-source and vendor solutions, such as DPLASMA, Chameleon, or NVIDIA cuSolverMG libraries.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Performance of Inline Compression with Software Caching for Reducing the Memory Footprint in pySDC
Hide Details
DescriptionThe volume of data required for high performance computing (HPC) jobs is growing faster than the memory storage available to store the required data, leading to performance bottlenecks. Hence the need for inline data compression, which reduces the amount of allocated memory needed by storing all data in its compressed format and decompressing/recompressing single variables as needed. We apply inline compression to HPC application pySDC, a framework that solves collocation problems iteratively using parallel-in-time methods. We introduce a new version of pySDC that has a compression manager to add inline compression functionality, along with a software cache that stores the decompressed state of the most frequently used variables. We use lossy compressor ZFP and test our model with varying software cache sizes. Results show that having no cache has the best compression ratio, but having a cache of size 16 improves the timing while also slightly improving the memory footprint.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

GPU Compression (for Scientific Data) Done Right
Hide Details
DescriptionError-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. Compared to CPU-based compressors, GPU-based compressors exhibit substantially higher throughputs, fitting better for today's HPC applications. To overcome the data challenge, GPU-based scientific lossy compressors have been created. Notably, cuSZ has been proposed as the error-bounded compression framework and has become the design base of the subsequent work. A plethora of derived work has been proposed, leading to the discussion of optimality considering data quality, compression ratio, and data processing speed. This paper covers new research directions: the compressibility study, the new encoding study, and the applicability study.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Design of Reliable and Efficient Syscall Hooking Library for a Parallel File System
Hide Details
DescriptionTo provide file systems in user space, FUSE and syscall interception libraries have been used. However, FUSE has a performance degradation problem, and the syscall interception library has problems with reliability and portability.
This study proposes a design and implementation of the reliable and efficient syscall hooking library using zpoline, which is a syscall hooking mechanism based on binary rewriting. To support POSIX interfaces in user space, it completely replaces all required system calls with file system function calls. The proposed method achieved performance results comparable to the native API, and performed 5.3 to 6.4 times better than FUSE.
This study proposes a design and implementation of the reliable and efficient syscall hooking library using zpoline, which is a syscall hooking mechanism based on binary rewriting. To support POSIX interfaces in user space, it completely replaces all required system calls with file system function calls. The proposed method achieved performance results comparable to the native API, and performed 5.3 to 6.4 times better than FUSE.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Computational Radiation Hydrodynamics with FleCSI
Hide Details
DescriptionPhotons are essential in many matter-light interaction systems, with applications in physics (e.g., core-collapse supernovae) and engineering (e.g., neutron transport). Although the theory of radiating fluids has been established since the 1980s, developing robust and efficient numerical simulations remains an active research area.
Using the FleCSI framework, we developed a radiation hydrodynamics code called HARD (Hydrodynamics And Radiative Diffusion), which is scalable and portable across various HPC architectures and computing systems. FleCSI provides us with a task-based parallelism framework supported by different backends such as MPI, HPX, and Legion. The on-node parallelism is then handled through Kokkos targeting CPU and GPU architectures.
This poster presents both our achievements in the physics implementation and its adaptation to the task-based model, and also the benchmark of the different backends using FleCSI on Los Alamos National Laboratory supercomputers.
Using the FleCSI framework, we developed a radiation hydrodynamics code called HARD (Hydrodynamics And Radiative Diffusion), which is scalable and portable across various HPC architectures and computing systems. FleCSI provides us with a task-based parallelism framework supported by different backends such as MPI, HPX, and Legion. The on-node parallelism is then handled through Kokkos targeting CPU and GPU architectures.
This poster presents both our achievements in the physics implementation and its adaptation to the task-based model, and also the benchmark of the different backends using FleCSI on Los Alamos National Laboratory supercomputers.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Characterizing the Performance of the GENE-X Code for Gyrokinetic Turbulence Simulations
Hide Details
DescriptionSimulating plasma turbulence in the edge region of a magnetic confinement fusion (MCF) device is crucial for identifying optimal operational scenarios for future fusion energy commercialization. GENE-X, an Eulerian electromagnetic gyrokinetic code, can simulate plasma turbulence throughout an MCF device, including the edge region. This work focuses on characterizing GENE-X's performance, such as the elapsed time during the time integration phase, memory usage, and file I/O. Two cases with different MPI decomposition schemes are analyzed using GENE-X's built-in profiler, along with profiling and monitoring tools such as IPM and Darshan. This study aims to provide a preliminary view of the HPC characteristics of the code to assist in future optimization efforts.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Turbocharging Dask Apps: Accelerating Data Flow with ProxyStore
Hide Details
DescriptionDespite advancements in distributed computing libraries, performance challenges, such as data serialization and transfer, still persist. We focus on understanding data limitations within Dask, a versatile and popular Python library designed for distributed and parallel computing, and then investigate the potential of using the pass-by-proxy paradigm implemented by ProxyStore to address these inefficiencies. By integrating ProxyStore, we streamline data flow in Dask applications, reducing overheads associated with data serialization and scheduler overheads.
Our approach evaluates the impact of proxies on data transfer times and overall computational efficiency. We find that our integration reduces task overheads by 5-6x on a real machine learning application.
Our approach evaluates the impact of proxies on data transfer times and overall computational efficiency. We find that our integration reduces task overheads by 5-6x on a real machine learning application.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

NetCDFaster: A Geospatial Cyberinfrastructure Enhancing Multi-Dimensional Scientific Dataset Access and Visualization Through Machine Learning Optimization
Hide Details
DescriptionThis project introduces an enhanced solution for accessing and processing NetCDF data, a widely used standard in geosciences for storing multidimensional data. Existing tools often compromise on performance or lack full workflow support. The proposed system integrates machine learning, specifically a CatBoost classifier, with a modern web application to improve the speed and accuracy of data querying and visualization. It provides a user-friendly interface for uploading NetCDF files and extracting metadata efficiently. Experimental results demonstrate a 64% F1-score in selecting optimal parameters and up to 80% improvement in processing time, significantly aiding scientific analysis.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Assessing the Impact of Real-Time Traffic Updates on Traffic Flow: A High-Performance Computing Perspective on Scalability and Demand
Hide Details
DescriptionIn the dynamic landscape of smart cities, it is necessary to improve the potential of real-time traffic data and high-performance computing to optimize traffic flow through dynamic re-routing strategies. Our research contributes to the assessment of how real-time traffic optimization and alternative route computation influences the overall improvement of traffic flow within cities. Our experimental scenarios take place under various traffic modeling and computational conditions, and deliver a scalability analysis and evaluation of several simulations on high-performance computing. Our approach involves simulations where different portions of vehicles dynamically adjust routes based on real-time traffic information. Also, scalability tests with varying computational workers and nodes assess our traffic simulator's capacity for scaling. One of our main findings shows that an informed management of live traffic data and selective alternative route computation can have a significant impact on the overall driving time and traffic flow within a city.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

PINE: Efficient Yet Effective Piecewise Linear Trees
Hide Details
DescriptionDecision trees are popularly used in statistics and machine learning. Piecewise linear trees, a type of model-based decision tree, employ linear models to evaluate splits and predict outcomes at the leaf nodes. While they can offer high accuracy, they are computationally expensive, and currently, no scalable implementations exist without harming accuracy.
We introduce PINE, an efficient yet effective approach for training piecewise linear trees, incorporating various algorithmic and system optimizations. These optimizations enable fast training on multicore CPUs without sacrificing model accuracy. We also present PINEBoost, which applies gradient boosting to PINE, and compare its performance with existing frameworks. Experimental results demonstrate that PINE and PINEBoost achieve superior accuracy and faster convergence rates across general datasets in regression tasks compared to state-of-the-art gradient boosting decision trees.
We introduce PINE, an efficient yet effective approach for training piecewise linear trees, incorporating various algorithmic and system optimizations. These optimizations enable fast training on multicore CPUs without sacrificing model accuracy. We also present PINEBoost, which applies gradient boosting to PINE, and compare its performance with existing frameworks. Experimental results demonstrate that PINE and PINEBoost achieve superior accuracy and faster convergence rates across general datasets in regression tasks compared to state-of-the-art gradient boosting decision trees.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

JACC: HPC Meta-Programming and Performance Portability Ecosystem for Julia Language
Hide Details
DescriptionWe present JACC (Julia for ACCelerators), the first high-level, meta-programming, and performance-portable model for the just-in-time and LLVM-based Julia language. JACC provides a unified and lightweight front end across different back ends available in Julia, enabling the same Julia code to run efficiently on many CPU and GPU targets. We evaluated the performance of JACC for common HPC kernels as well as for the most computationally demanding kernels used in applications such as HARVEY, a blood flow simulator to assist in the diagnosis and treatment of patients suffering from vascular diseases. We carried out the performance analysis on the most advanced U.S. DOE supercomputers: Aurora, Frontier, and Perlmutter. Overall, we show that JACC has a negligible overhead versus vendor-specific solutions, reporting GPU speedups over the CPU implementations with no extra cost.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX
Profiling the Impact of Hyper-Threading on Pagosa Hydrocodes
Hide Details
DescriptionPagosa is a hydrodynamics computer code designed for massively parallel environments, operating within an Eulerian framework using a fixed Cartesian mesh. This project investigates the performance of Pagosa on Sapphire Rapids nodes, which feature many-core architectures and high bandwidth memory. By conducting strong and weak scaling studies, we aim to evaluate the impact of hyper-threading and propose modifications to Pagosa’s MPI environment, including a hybrid MPI and OpenMP parallel decomposition. The anticipated outcomes will inform strategies for optimizing Pagosa, ensuring it remains capable of tackling complex computational problems efficiently.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Neural Network Optimization and Performance Analysis for Real-Time Object Detection at the Edge
Hide Details
DescriptionReal-time object detection is an important and computationally intensive task that is gaining more attention in the field of autonomous systems. Recently, a novel object detection algorithm called RT-DETR has emerged, demonstrating superior speed compared to the popular YOLO series. In recent years, many edge devices optimized for artificial intelligence have been developed, allowing for faster model inference. Our study uses NVIDIA TensorRT to optimize models in the task of object tracking and detection on NVIDIA’s Orin device. Our best resulting model is the FP16 model with DLA with an average inference time of 19.9416 milliseconds and throughput of 50.1465 (frames) per second. This is a five-fold improvement compared with the standard unoptimized Pytorch FP32 model, with practically no accuracy sacrifice. Our study shows that applying TensorRT and quantization on object tracking and detection on NVIDIA’s Orin device is effective in reducing prediction time, allowing for faster detection.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Exploiting Data Compression and Low Precision for Exascale Fusion Turbulence Simulations
Hide Details
DescriptionFuture exascale systems will feature unprecedented computing power with over 10^18 FLOPS, provided by thousands of heterogeneous computing nodes. To fully harness such potential, applications must scale effectively and efficiently utilize these computing architectures.
Many performance losses on large HPC systems originate from inefficiencies in data movement.
On exascale systems, the involved data set size also increases, leading to challenges in (a) migrating data through memory hierarchies, (b) communicating data between distributed memory components, and (c) storing data in file systems.
This poster shows the ongoing effort in the DaREXA-F project and its goal to address these issues in the plasma turbulence code GENE, by measures such as (a) mixed precision and the usage of novel data formats, (b) data compression, and (c) utilizing new network hardware.
We present our current findings in the areas of mixed-precision and lossy data compression for efficient computation without a reduction in accuracy.
Many performance losses on large HPC systems originate from inefficiencies in data movement.
On exascale systems, the involved data set size also increases, leading to challenges in (a) migrating data through memory hierarchies, (b) communicating data between distributed memory components, and (c) storing data in file systems.
This poster shows the ongoing effort in the DaREXA-F project and its goal to address these issues in the plasma turbulence code GENE, by measures such as (a) mixed precision and the usage of novel data formats, (b) data compression, and (c) utilizing new network hardware.
We present our current findings in the areas of mixed-precision and lossy data compression for efficient computation without a reduction in accuracy.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Enhancing HPC Resource Management to Integrate Quantum Workflows
Hide Details
DescriptionIn recent years, quantum computing has demonstrated the potential to revolutionize specific algorithms and applications by solving problems exponentially faster than classical computers. However, its widespread adoption for general computing remains a future prospect. In this work we demonstrate the integration of quantum computing within high-performance computing (HPC) environments. We developed a resource management framework that streamlines the use of quantum simulators and enhances HPC/QC hybrid application runtime performance and workflow efficiency.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

LM-Offload: Performance Model-Guided Generative Inference of Large Language Models with Parallelism Control
Hide Details
DescriptionLarge language models (LLMs) have achieved remarkable success in various natural language processing tasks. However, LLM inference is highly computational and memory-intensive, creating extreme deployment challenges. Tensor offloading, combined with tensor quantization and asynchronous task execution, provides a feasible solution by utilizing host memory to enable large-scale LLM inference with a limited number of GPUs. However, existing approaches struggle to fully utilize all available computational and memory resources due to a lack of consideration for (1) whether to use quantization effectively, and (2) managing thread-level parallelism within and across tasks. As a result, these approaches provide suboptimal solutions. In this paper, we introduce LM-Offload, a framework that addresses the above challenges by leveraging performance modeling and parallelism control. Experimental results demonstrate that LM-Offload outperforms FlexGen and ZeRO-Inference, two state-of-the-art systems for LLM inference, by up to 2.95× (2.34× on average) and 2.88× (1.57× on average), respectively, in inference throughput.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

QDD: Multi-Node Implementation of Decision Diagram-Based Quantum Circuit Simulator with Ring Communication and Auto SWAP Insertion
Hide Details
DescriptionThis poster introduces QDD, a multi-node implementation of a decision diagram (DD)-based quantum circuit simulator. DD-based simulators offer faster simulation of algorithms like Shor's compared to statevector (SV) simulators by compressing the quantum state using a graph representation. However, parallelizing DD-based simulators has been challenging due to their dynamic data structures.
QDD addresses this by distributing the quantum state across multiple nodes and using ring communication to minimize communication overhead. Automatic SWAP gate insertion further optimizes communication. Experiments show QDD significantly outperforms the SV-based simulator in simulating Shor's algorithm. With 256 nodes, QDD achieves up to 10x faster runtime compared to a single-node implementation. The experiment also examined the number of processes per node and found that one process per node was preferable unless rack-to-rack communication occurred.
The poster explains the background of quantum simulation and decision diagram, and then explains the multi-node method and experimental results in detail.
QDD addresses this by distributing the quantum state across multiple nodes and using ring communication to minimize communication overhead. Automatic SWAP gate insertion further optimizes communication. Experiments show QDD significantly outperforms the SV-based simulator in simulating Shor's algorithm. With 256 nodes, QDD achieves up to 10x faster runtime compared to a single-node implementation. The experiment also examined the number of processes per node and found that one process per node was preferable unless rack-to-rack communication occurred.
The poster explains the background of quantum simulation and decision diagram, and then explains the multi-node method and experimental results in detail.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

PerfFlowAspect: A User-Friendly Performance Tool for Scientific Workflows
Hide Details
DescriptionIn scientific computing, artificial intelligence and/or machine learning (AI/ML) are appearing increasingly often in scientific workflows on supercomputers, due to their ability to solve more complex problems. With respect to performance analysis tools, the nature of these workflows creates new requirements for performance analysis tools, in particular incentivizing lower integration costs and support for more diverse codes.
In response, we introduce PerfFlowAspect, which approaches this problem via reduced instrumentation costs, support for C/C++ and Python code bases, and multiple trace formats that support multiple workflow components. To evaluate its effectiveness, we consider the use cases of AMS: a complex application to simplify machine learning surrogate model integration in HPC codes.
PerfFlowAspect is an open-source tool under active research and development. At the poster session, I will present my work with the aid of the poster by elaborating the data, text and figures in it.
In response, we introduce PerfFlowAspect, which approaches this problem via reduced instrumentation costs, support for C/C++ and Python code bases, and multiple trace formats that support multiple workflow components. To evaluate its effectiveness, we consider the use cases of AMS: a complex application to simplify machine learning surrogate model integration in HPC codes.
PerfFlowAspect is an open-source tool under active research and development. At the poster session, I will present my work with the aid of the poster by elaborating the data, text and figures in it.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX
Similar Presentations

Predicting Dataset Popularity for Improved Distributed Content Caching in High Energy Physics
Hide Details
DescriptionIn High Energy Physics (HEP), large-scale experiments generate massive amounts of data that are distributed globally. To reduce redundant data transfers and improve analysis efficiency, a disk caching system named XCache is used to manage data accesses. By analyzing 11 months of access logs (4.5 million requests), we identified patterns in dataset usage and developed a predictive model to forecast the popularity of frequently accessed datasets.
Based on extensive exploratory data analysis, we found that pinging the most popular datasets (pinning these in the cache) could significantly improve access efficiency, and we implemented an LSTM model to predict dataset accesses and optimize cache policies.
The model demonstrates strong predictive performance with a low mean relative error of 0.779 across training and test datasets. Future work will incorporate anomaly detection techniques to improve robustness. This study highlights the potential of LSTM models in optimizing distributed content caching in HEP.
Based on extensive exploratory data analysis, we found that pinging the most popular datasets (pinning these in the cache) could significantly improve access efficiency, and we implemented an LSTM model to predict dataset accesses and optimize cache policies.
The model demonstrates strong predictive performance with a low mean relative error of 0.779 across training and test datasets. Future work will incorporate anomaly detection techniques to improve robustness. This study highlights the potential of LSTM models in optimizing distributed content caching in HEP.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX
Similar Presentations

Enhancing the Traditional Benchmarks for Parallel Computing Education
Hide Details
DescriptionThis work introduces enhanced benchmark suites — NeoRodinia, RedOptBench and CUDAMicroBench — specifically designed to enrich the educational landscape of parallel programming. By integrating practical examples and detailed optimization processes into traditional benchmarks, these suites could illuminate performance-limiting issues, identify inefficient patterns, and clarify the steps involved in optimization. They aim to demystify the complexities of parallel programming for beginners, fostering a deep and practical understanding of the subject. Serving dual roles as performance evaluators and comprehensive educational tools, these suites effectively demonstrate tangible performance improvements and optimization techniques, thereby enhancing both theoretical knowledge and practical skills in parallel programming.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

FAS-GED: GPU-Accelerated Graph Edit Distance Computation
Hide Details
DescriptionGraph Edit Distance (GED) is a fundamental metric for assessing graph similarity with critical applications across various domains, including bioinformatics, classification, and pattern recognition. However, the exponential computational complexity of GED has hindered its adoption for large-scale graph analysis. This poster presents FAS-GED, a GPU framework for fast and accurate GED computation. FAS-GED achieves significant performance gains by optimizing memory accesses and minimizing data transfer while maintaining high accuracy. FAS-GED shows up to a 300x speedup over its CPU-based implementations on 48-CPU AMD EPYC. Our approach surpasses existing methods in speed and precision, demonstrating up to a 55x speedup over the NetworkX library for small graphs and reaching optimal solutions in 94% of cases. FAS-GED is a step toward unlocking the potential of GED for large-scale graph analysis in real-world applications.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Scored Non-Deterministic Finite Automata Processor for Sequence Alignment
Hide Details
DescriptionWith the surge in symbolic data across various fields, efficient pattern matching and regular expression processing have become crucial. Non-deterministic Finite Automata (NFA), commonly used for pattern matching, face memory bottlenecks on general-purpose platforms. This has driven interest in Domain-Specific Architectures (DSAs), like FPGA and ASICs, for their efficiency. Modern applications require identifying the best match path, such as in DNA sequence alignment. This work aims to enhance FPGA-based automata processors to report the best sequence alignment score by integrating weights into automata transitions. Challenges include increased state-space complexity and memory requirements. The proposed NAPOLY+ design incorporates score values and arithmetic components to manage scores, balancing performance and resource use. Evaluation on the Zynq Ultrascale+ FPGA showed high device utilization and scalability, with preliminary results focusing on end-to-end design evaluation.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Large-Scale Randomized Program Generation with Large Language Models
Hide Details
DescriptionLarge, diverse datasets of executable programs are required for training and running machine learning models to find insights in program performance. While many open-source code repositories exist freely on popular software development websites such as GitHub, the safety and executability of such programs cannot be guaranteed. To bridge this gap, this study proposes LLMRPG (Large Language Model-based Randomized Program Generator), a program generator that harnesses open-source large language models (LLMs) fine-tuned for code generation to generate error-free, executable, and human-like programs on demand. The performance of LLMRPG was evaluated across popular open-source LLMs using heuristics such as the semantic similarity between programs, and the proportion of compilable and executable programs generated by LLMRPG. Analysis on the programs generated by LLMRPG demonstrates that these programs have satisfactory compilability and executability, as well as high diversity.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Comparing Cache Utilization Trends for Regional Scientific Caches with Transfer Learning Models
Hide Details
DescriptionTo enhance data sharing and reduce access latency in scientific collaborations, high energy physics (LHC CMS experiment) employs regional in-network storage caches. Accurate predictions of cache utilization trends help design new caching policies and improve capacity planning. This study leverages the SoCal cache access trends to improve prediction on the newer caches in Chicago and Boston through transfer learning. We also investigate the impact of doubling the Chicago cache's storage capacity on its cache hit rate.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Analyzing Alltoall Algorithms with SST
Hide Details
DescriptionAlltoall collective operations in MPI are critical in several types of computation, including matrix multiplication and transposition, and machine learning applications. As a result, it is critical that these operations are performant for large amounts of data. Meanwhile, dragonfly networks are becoming more common in state-of-the-art supercomputers. However, there has been little analysis of the performance of alltoall operations on these networks. The hierarchical and modular nature of dragonfly networks results in distinct challenges in alltoall operations, though typical alltoall algorithms fail to account for topology. In this poster, we analyze the performance of the alltoall algorithm in four scenarios, and discuss the conditions under which each algorithm performs best.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Enhancing Performance Reproducibility on HPC Workflows
Hide Details
DescriptionReproducibility is related to achieving consistent performance across multiple runs of the same application in an identical computing environment. From a computational perspective, jobs should run for equal time, with equal performance repeatedly. Without this, we see significant variations in performance, which can undermine the reliability of scientific results. However, the complexity and scale of these workflows present unique challenges, especially when it comes to achieving consistent performance across repeated runs. We seek to provide researchers with Findable, Accessible, Interoperable and Reusable (FAIR) data. Ensuring the “FAIRness” of (meta)data can reduce barriers to reproducibility by making this information easier to find and interpret, programmatically access, and reuse in new contexts. Therefore we are exploring the process of analyzing performance data and seek to integrate our findings in the RECUP framework for reproducibility, showing data sources, repository saving intermediate results, and user analysis of performance and result reproducibility.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

A Novel Gradient Compression Design with Ultra-High Compression Ratio for Communication-Efficient Federated Learning
Hide Details
DescriptionFederated learning is a privacy-preserving machine learning approach. It allows numerous geographically distributed clients to collaboratively train a large model while maintaining local data privacy. In heterogeneous device settings, limited network bandwidth is a major bottleneck that constrains system performance. In this work, we propose a novel gradient compression method for federated learning that aims to achieve communication efficiency and a low error floor by estimating the prototype of gradients on both the server and client sides and sending only the difference between the real gradient and the estimated prototype. This approach further reduces the total bits required for model updates. Additionally, the memory requirement will be lighter on the client side but heavier on the server side compared to traditional error feedback methods. Experiments on training neural networks show that our method is more communication-efficient with little impact on training and test accuracy.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

PipeInfer: Accelerating LLM Inference Using Asynchronous Pipelined Speculation
Hide Details
DescriptionInference of large language models (LLMs) across computer clusters has become a focal point of research in recent times, with many acceleration techniques taking inspiration from CPU speculative execution. These techniques reduce memory bandwidth requirements, but also increase
latency per inference run, requiring high speculation acceptance rates to improve performance. As a remedy, we propose PipeInfer, a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization while also improving tolerance to low speculation acceptance rates and low-bandwidth interconnects. PipeInfer exhibits up to a 2.15× improvement in generation speed over standard speculative inference. PipeInfer achieves its improvement through Continuous Asynchronous Speculation and Early Inference Cancellation; the former improves latency and generation speed by running single-token inference simultaneously with several speculative runs, while the latter improves speed and latency by skipping the computation of invalidated runs.
latency per inference run, requiring high speculation acceptance rates to improve performance. As a remedy, we propose PipeInfer, a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization while also improving tolerance to low speculation acceptance rates and low-bandwidth interconnects. PipeInfer exhibits up to a 2.15× improvement in generation speed over standard speculative inference. PipeInfer achieves its improvement through Continuous Asynchronous Speculation and Early Inference Cancellation; the former improves latency and generation speed by running single-token inference simultaneously with several speculative runs, while the latter improves speed and latency by skipping the computation of invalidated runs.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX
Similar Presentations

Breaking the Barriers to Effective Supercomputing: Web Dashboard for Job Accounting and Performance Metrics
Hide Details
DescriptionThe NSF-funded Anvil supercomputer, built and maintained by the Purdue University Rosen Center for Advanced Computing, enables efficient research computing across a variety of scientific domains nationwide. In addition to the traditional terminal interface, Anvil uses the open-source web portal framework Open OnDemand to provide a low barrier web interface to the Anvil supercomputer. In this poster we describe enhancements that were made to Anvil's Open OnDemand dashboard to provide a clean, well-structured, and extensible interface for researchers to visualize useful information about their utilization of Anvil without accessing the terminal. Our enhancements include various apps that provide detailed statistics about user jobs and their respective performance metrics while also focusing on data query performance. The information in these apps will enable users to identify and debug any jobs that are noticeably resource-inefficient, as well as improve the queue wait time and efficiency of their jobs in the future.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

A Sparse Approach for Translation-Based Training of Knowledge Graph Embeddings
Hide Details
DescriptionKnowledge graph (KG) learning offers a powerful framework for generating new knowledge and making inferences. Training KG embedding can take a significantly long time, especially for larger datasets. Our analysis shows that the gradient computation of embedding and vector normalization are the dominant functions in the KG embedding training loop. We address this issue by replacing the core embedding computation with SpMM (Sparse-Dense Matrix Multiplication) kernels. This allows us to unify multiple scatter (and gather) operations as a single operation, reducing training time and memory usage. Applying this sparse approach in training the TransE model results in up to 5.7x speedup on the CPU and up to 1.7x speedup on the GPU. Distributing this algorithm on 64 GPUs, we observe up to 3.9x overall speedup in each epoch. Our proposed sparse approach can also be extended to accelerate other translation-based models such as TransR and TransH.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX
Similar Presentations

Scalable Low-Latency Hardware Function Chaining with Chain Control Circuit
Hide Details
DescriptionTo deliver advanced services with high performance and flexibility, we are developing a computing platform that integrates various hardware accelerators (HWAs). Our goal is to build customized services for each user by combining application functions. To achieve this, we propose Hardware Function Chaining (HFC) technology that enables sharing a stateful function among multiple users and low-latency data transfer between HWAs. HFC uses a chain control circuit that allows the HWA to autonomously manage the destinations of multiple data flows (function chains). This method avoids CPU bottlenecks. We compared our HFC-based system with a look-aside configuration, where chain control is handled by the CPU, and evaluated the performance involving up to eight NOP functions in scenarios with multiple different function chains. The results showed that our approach reduces latency to up to 1/13 that of the look-aside configuration and maintains a stable latency and throughput as the system scales.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

HARVEST-2.0: High-Performance Vision Framework for End-to-End Preprocessing, Training, Inference, and Visualization
Hide Details
DescriptionDeep learning (DL) thrives on data; however, it inherits a major limitation: training and testing datasets must be fully annotated for supervised deep neural networks (DNNs) training. To address this challenge, we introduce HARVEST-2.0, a high-performance computer-vision framework for end-to-end data preprocessing, training, inference, and visualization of computer vision tasks. HARVEST-2.0 utilizes cutting-edge semi-supervised learning algorithms requiring only a small subset of labeled data samples. HARVEST-2.0 provides an intuitive web-based interface, enabling domain experts with no prior DL or HPC knowledge to preprocess data, geotag images, train DL models on HPC systems, perform inference, and visualize the results. Our evaluations demonstrate accuracies within 3\% compared to fully supervised training, utilizing less than 80 labeled samples per class, and near-linearly reducing the execution time. HARVEST-2.0 is an effort along AI democratization, enabling end-users to carry out preprocessing, interactive labeling, inference, and distributed training in a user-friendly and flexible manner.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX
Similar Presentations

Exploring Fine-Grained Memory Analysis for PIM Offloading
Hide Details
DescriptionIn modern computing, a challenge is the data bottleneck between the CPU and RAM. This issue arises because the CPU can process data faster than it can be accessed from RAM; this is worsened by the fact that large amounts of RAM are less accessible than a powerful CPU. Furthermore, RAM’s high cost creates a need for a cost-effective solution. Processing in Memory (PIM) offers a potential remedy by reducing data movement, thus alleviating bottlenecks. To optimize the use of this new hardware, developers need to identify when to offload their programs to a PIM device. To address this need, we have developed a solution that enables developers to run Python programs through our pipeline, highlighting the memory-intensive parts of their code.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Uncover the Overhead and Resource Usage for Handling KV Cache Overflow in LLM Inference
Hide Details
DescriptionLLM inference includes two phases: prefill phase and decode phase. Prefill phase processes input tokens simultaneously to generate the first token. Decode phases generate the subsequent tokens one after another until they either meet a termination or reach the max length. To avoid recomputation, the Key-Value (KV) cache has become a standard approach used for storing previously computed keys and values. Throughout LLM inference, the KV cache memory space grows linearly with context-length and batch sizes, easily running out of the GPU memory of an instance. SOAT LLM inference usually uses recomputation/swap to handle KV cache overflow. Both recomputation and swap introduce overhead. However, the overhead of these strategies and the resource utilization over time during LLM inference have not been explored. This work aims to fill this gap by quantifying the overhead of recomputation/swapping, and analyzing the resource utilization during LLM inference to derive insights.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Scalable Performance and Accuracy Analysis for Distributed and Extreme-Scale Systems
Hide Details
DescriptionThe Scalable Performance and Accuracy analysis for Distributed and Extreme-scale systems (SPADE) project focuses on advancing monitoring, optimization, evaluation, and decision-making capabilities for extreme-scale systems. This poster presents efforts targeting several advanced monitoring capabilities, including developing support for AMD's new RocProfiler SDK to enable the analysis of hardware performance counters on AMD APUs, such as the MI300, which will be integrated into El Capitan. Another effort involves extending the PAPI library for heterogeneous CPU support, allowing users to simultaneously monitor the performance of chips that support both high-end and low-end processors, enabling more effective tuning between various cores. Additionally, the project includes the development of a Python version of PAPI (cyPAPI), specifically for use with frameworks and tools being developed for Python in HPC environments. This effort extends to exploring beta versions of cyPAPI with PyTorch to advance decision-making capabilities for mixed-precision tuning of machine learning applications.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Improving the Performance of Proof-of-Space in Blockchain Systems
Hide Details
DescriptionBlockchain technologies enable the success of digital currencies by providing security, decentralization, and trustless operation. Two dominant consensus algorithms, Bitcoin’s Proof-of-Work and Ethereum’s Proof-of-Stake, balance security, scalability, and energy efficiency, though PoW is energy-intensive and PoS faces centralization risks. Chia’s Proof-of-Space (PoSpace) offers a middle ground, using storage (instead of computation) for validation in the network while maintaining decentralization. PoSpace turns the computational-intensive problem into a data-intensive known as plotting. However, Chia’s plotting process stresses hardware, requiring expensive setups and shortening the lifespan of solid-state drives. This work takes a clean-slate approach to implementing an efficient PoSpace system that is lightweight and operates on small nodes (e.g. Raspberry Pis with 4-cores and 2GB RAM) to large systems (HPC server with 192-cores, 790GB RAM, and multiple NVMe storage devices). Our C and Rust implementations achieve significantly higher performance than Chia in plot generation and lookup efficiency across all system sizes.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Communication Hiding for Matrix-Free Finite Element Operators of a Complex PDE: Nonlinear Stokes Flow of Earth’s Mantle
Hide Details
DescriptionFor large-scale matrix-free finite-element PDE solvers, parallel matrix-vector products typically comprise the dominant computational cost. Synchronization steps for input and output, when degrees of freedom (DOFs) lying on inter-process boundaries are communicated, become the dominant serial portion of the program and the main scalability bottleneck. However, the cost of communication can be mitigated if it can be overlapped with local computations which do not require the values of DOFs on process boundaries. In this research, we study the nonlinear Stokes solver of the mantle convection code Rhea, comparing several different methods for overlapping communication with computation during matrix-vector products, including a new dynamic method which automatically adjusts to measured imbalances in communication waiting times. We observe significant improvements in the waiting times, and in the overall computation times, for the matrix-vector products in Rhea.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Increasing the Efficiency of Neutral Atoms by Reducing Qubit Waste from Measurement-Related Ejections
Hide Details
DescriptionQuantum computing is an emerging field that has had an impact on various domains. This poster focuses on the advantageous neutral atom technology for quantum computing. In neutral atom technology, a significant challenge arises in the measurement of quantum output, where the need to eject atoms physically leads to substantial time wastage in reloading arrays. To address this issue, we introduce a novel technique that leverages the probabilistic nature of quantum programs to reduce qubit ejections and atom array reloads.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

The P3 Explorer: Exploring the Performance, Portability, and Productivity Wilderness
Hide Details
DescriptionThis paper documents the development of a web-based tool designed to organise and present visual representations of performance, portability, and productivity (P3) data from previously published scientific studies. The P3 Explorer operates as both an open repository of scientific data and a data dashboard, providing visual heuristic analyses of performance portability and developer productivity, created using the Intel P3 Analysis library. The aim of the project is to create a community-led database of P3 studies to better inform application developers of alternative approaches to developing new applications targeting high performance on diverse hardware, with consideration of developer productivity.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

GNN-RL: An Intelligent HPC Resource Scheduler
Hide Details
DescriptionEfficient resource allocation in high-performance computing (HPC) environments is crucial for optimizing utilization, minimizing make-span, and enhancing throughput. We propose GNN-RL, a novel intelligent scheduler that leverages a hybrid Graph Neural Network and Reinforcement Learning model, learning from historical workload data to implement optimal scheduling policies. Experimental results show that GNN-RL significantly outperforms conventional methods. Compared to the First-Come-First-Served (FCFS) baseline, GNN-RL achieves a 2.1-fold increase in resource utilization (84.25\% vs. 39.84\%), a 114-fold improvement in throughput (40,061.86 vs. 351.69 jobs/s), and a 114-fold reduction in make-span (4.50s vs. 513.11s). GNN-RL also surpasses EASY Backfilling, with 1.3 times higher resource utilization and 2 times better throughput and make-span. The fairness index remains consistent, indicating that GNN-RL maintains fairness while improving other metrics. Our findings suggest GNN-RL is a significant advancement in intelligent HPC resource management, enabling more efficient and responsive computing environments.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

PcMINER: Mining Performance-Related Commits at Scale
Hide Details
DescriptionPerformance inefficiencies in software can severely impact application quality and resource utilization. Addressing these issues often requires significant developer effort, yet the lack of large-scale, open-source performance datasets hinders the development of effective mitigation strategies. To fill this gap, we present PcMINER, a tool that mines performance inefficiency-related commits from GitHub at scale. PcMINER uses PcERT-KD, a transformer model that classifies these commits with accuracy comparable to 7B parameter LLMs but with reduced computational costs, making it ideal for CPU cluster deployment. By mining GitHub repositories with a 50-node CPU cluster, PcMINER has generated a dataset of 162K performance-related commits in C++ and 103.8K in Python. This dataset promises to enhance data-driven approaches to detecting performance inefficiencies.
In the poster session, I will present the problem, motivation, methodology, and results, with additional details that may be accessible through a QR code, and will provide a brief oral overview.
In the poster session, I will present the problem, motivation, methodology, and results, with additional details that may be accessible through a QR code, and will provide a brief oral overview.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Profiling Communication Overhead in 3D Parallel Pretrain of Large Language Models
Hide Details
DescriptionTraining large language models (LLMs) efficiently requires addressing the communication overhead introduced by parallelism strategies like Tensor, Pipeline, and Data Parallelism. This work profiles the communication patterns in LLM pretraining using the Polaris supercomputer, highlighting the impact of Tensor Parallelism, which suffers from significant overhead as parallelism scales. To mitigate this, we apply hZCCL, a homomorphic compression technique that reduces communication costs by eliminating decompression-operation-compression cycles. Our results show hZCCL accelerates training, achieving up to 6.77× speedup in multi-threaded modes while maintaining data accuracy. These improvements allow for more efficient scaling of LLM pretraining across distributed nodes.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

An Accurate and Scalable Multidimensional Quantum Solver for Partial Differential Equations
Hide Details
DescriptionQuantum computing is an innovative technology that can solve certain problems faster than classical computing. One of its promising applications is in solving partial differential equations (PDEs). However, current PDE solvers that are based on variational-quantum-eigensolver (VQE) techniques suffer from low accuracy, high execution times, and low scalability on noisy-intermediate-scale-quantum (NISQ) devices, especially for multidimensional PDEs.
We introduce a highly accurate and scalable quantum algorithm for solving multidimensional PDEs and present two variants of our algorithm. The first leverages classical-to-quantum (C2Q) encoding, finite-difference-method (FDM), and numerical instantiation, while the second employs C2Q, FDM, and column-by-column decomposition (CCD). To evaluate our algorithm, we have used a multidimensional Poisson equation. Our results demonstrate higher accuracy, higher scalability, and faster execution times compared to VQE-based solvers on noise-free and noisy quantum simulators from IBM. We have also investigated our proposed algorithm on hardware emulators, employing various noise mitigation techniques with encouraging preliminary results.
We introduce a highly accurate and scalable quantum algorithm for solving multidimensional PDEs and present two variants of our algorithm. The first leverages classical-to-quantum (C2Q) encoding, finite-difference-method (FDM), and numerical instantiation, while the second employs C2Q, FDM, and column-by-column decomposition (CCD). To evaluate our algorithm, we have used a multidimensional Poisson equation. Our results demonstrate higher accuracy, higher scalability, and faster execution times compared to VQE-based solvers on noise-free and noisy quantum simulators from IBM. We have also investigated our proposed algorithm on hardware emulators, employing various noise mitigation techniques with encouraging preliminary results.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Generalizing ExaDigiT Datacenter Digital Twin Framework for Multiple Architectures
Hide Details
DescriptionExaDigiT is an open-source framework for developing comprehensive digital twins (DTs) of liquid-cooled supercomputers. DTs merge telemetry, simulations, and AI/ML/RL to create a complete virtual representation of a system. This provides an effective tool for testing a variety of system optimizations, determining the impact and outcomes of hypothetical “what-if” scenarios, and creating virtual prototypes of future systems with performance and cost insights. Our framework consists of three primary modules including (1) a resource allocator and power simulator, (2) a thermofluidic cooling model, and (3) an augmented reality model of a supercomputing cluster and its cooling plant. ExaDigiT can predict the power and energy losses of synthetic and real workloads, simulate complex transient dynamics to provide accurate cooling predictions, and provide an interactive means of analyzing relevant data. ExaDigiT is released under Apache and MIT licenses.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Edge-Enabled Real-Time Data Processing in Power-Efficient Weather Stations Using IBIS
Hide Details
DescriptionThere is a growing need to acquire a larger quantity of meteorological data to address climate change. In this paper, we design an improved Automatic Weather Station (AWS) based on a prototype from the National Center for Atmospheric Research (NCAR). We integrate this weather station with IBIS, a platform for adaptable, multi-sensor data collection on edge devices. Our solution utilizes a Raspberry Pi 4 to aggregate sensor data from AWSs over LOng-RAnge (LoRa) radio frequency. A real-time data visualization platform, using Grafana, InfluxDB, and hosted on the Chameleon testbed, is presented. We show how the expanded peripherals allow for the implementation of novel weather forecasting techniques and demonstrate the power efficiency of our solution by comparing the power consumption of our choice of microcontroller to the Raspberry Pi. Lastly, we examine how our implementation can address challenges in big-data weather forecasting.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

CoVA: Compiler for Versatile Architectures
Hide Details
DescriptionThe rapid advancement in computing demands and the increasing complexity of modern applications (e.g., image processing, numerical computation, and machine learning) necessitate efficient heterogeneous computing solutions. Project CoVA proposes an MLIR-based compilation flow designed to bridge the gap between high-level algorithm development with Python and diverse hardware architecture development with C++, including CPUs, GPUs, FPGAs, and quantum computing units. By utilizing a high-level MLIR dialect specifically designed for CoVA, we achieve the decoupling of algorithms from backend hardware, facilitating more efficient algorithm development and significantly reducing development cycles. Moreover, our design enhances development efficiency within HPC platforms equipped with heterogeneous accelerators, enabling faster and more streamlined development processes.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Trusted Platform Provisioning for the OpenCHAMI Cluster Management Stack
Hide Details
DescriptionHigh performance computing (HPC) clusters have traditionally relied on proprietary provisioning and management infrastructure. This can be problematic, especially with regard to ongoing security and maintenance for vendored systems.
As an alternative to this, the Los Alamos National Laboratory (LANL) leads development of the Open Composable Heterogeneous Application Management Infrastructure (OpenCHAMI) stack, which provides a modular suite of size- and platform-independent cluster management tools. A major barrier to the full deployment of OpenCHAMI at LANL is its lack of authentication for access to sensitive data, such as private SSH keys or service tokens. To resolve this, we implement and integrate a node authentication system, under which secret configuration data may be requested only by system processes or authorized users.
We present a containerized microservice-based authentication system for post-boot compute node configuration, based on the Canonical cloud-init platform. This system is optimized to minimize its impact on cluster boot speed.
As an alternative to this, the Los Alamos National Laboratory (LANL) leads development of the Open Composable Heterogeneous Application Management Infrastructure (OpenCHAMI) stack, which provides a modular suite of size- and platform-independent cluster management tools. A major barrier to the full deployment of OpenCHAMI at LANL is its lack of authentication for access to sensitive data, such as private SSH keys or service tokens. To resolve this, we implement and integrate a node authentication system, under which secret configuration data may be requested only by system processes or authorized users.
We present a containerized microservice-based authentication system for post-boot compute node configuration, based on the Canonical cloud-init platform. This system is optimized to minimize its impact on cluster boot speed.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Parallelization of the Finite Element-Based Mesh Warping Algorithm Using Hybrid Parallel Programming
Hide Details
DescriptionWarping large volume meshes is computationally expensive and has applications to biomechanics, aerodynamics, image processing, and cardiology. Existing parallel implementations of mesh warping algorithms do not take advantage of shared-memory and one-sided communication features available in MPI-3. In this poster, we describe our parallelization of the finite element-based mesh warping algorithm for tetrahedral meshes. Our implementation takes advantage of shared memory and one-sided communication and deforms a mesh by solving a linear system with multiple right-hand sides based on the solution of a Poisson boundary value problem. Our results demonstrate excellent efficiency and strong scalability on up to 32 cores on a single node. Furthermore, we show a 90% increase in speedup with 256 cores distributed uniformly across 64 nodes versus our largest single node speedup while observing sublinear speedups overall.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

QFw: A Quantum Framework for Large-Scale HPC Ecosystems
Hide Details
DescriptionThis work extends Quantum Framework (QFw) by integrating it with Northwest Quantum Simulator (NWQ-Sim) and by introducing a lightweight python library (qfw\_backend) that allows multiple frontends (e.g., Qiskit) to interact with QFw. This extension enables QFw to flexibly decouple frontends from backends (e.g., NWQ-Sim). We demonstrate this capability by executing a Greenberger-Horne-Zeilinger (GHZ) circuit using Qiskit and Pennylane with different backends. Also, QFw enables easy scaling to multiple nodes. We showcase this by running GHZ scaling tests up to 32 qubits for different numbers of nodes on Frontier. To demonstrate the use of QFw for real-world problems, we solve a metamaterial optimization problem which uses a Quantum Approximate Optimization Algorithm (QAOA). We observe that QFw over NWQ-Sim marginally improves Qiskit-aer's accuracy. These additions prepare QFw to run hybrid applications in a hybrid resource environment since it treats actual quantum hardware and simulators alike.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Efficient Approaches to Analyzing Large Dynamic Networks
Hide Details
DescriptionDynamic graphs, characterized by their evolving topologies over time, necessitate continuous updates to their associated graph properties, including shortest paths, vertex coloring, and strongly connected components. Traditional static graph algorithms, which re-compute properties following each modification, typically falter in efficiency under such conditions. In this paper, we introduce a suite of methodologies implemented within our software platform, CANDY, designed to efficiently analyze dynamic graphs. We propose a generic framework that supports the parallel updating of graph properties across large networks subject to various types of changes. Our results demonstrate the enhanced performance of these update algorithms in managing large dynamic networks, highlighting significant improvements over conventional approaches.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

JUmPER: Performance Data Monitoring, Instrumentation and Visualization for Jupyter Notebooks
Hide Details
DescriptionComputational performance, e.g. CPU or GPU utilization, is crucial for analyzing machine learning (ML) applications and their resource-efficient deployment. However, the ML community often lacks accessible tools for holistic performance engineering, especially during exploratory programming such as implemented by Jupyter. Therefore, we present JUmPER, a Jupyter kernel that supports coarse-grained performance monitoring and fine-grained analysis tasks of user code in Jupyter.
JUmPER collects system metrics and stores them alongside executed user code. Additionally, code instrumentation can be enabled to collect performance events using Score-P. Built-in Jupyter magic commands provide visualizations of the monitored performance data directly in Jupyter. In addition, JUmPER preserves the exploratory programming experience by seamlessly integrating with Jupyter and reducing kernel runtime overhead through in-memory (pipe) communication and parallel marshalling of Python's interpreter state for the Score-P execution.
JUmPER thus provides a low-hurdle infrastructure for performance engineering in Jupyter and supports resource-efficient ML applications.
JUmPER collects system metrics and stores them alongside executed user code. Additionally, code instrumentation can be enabled to collect performance events using Score-P. Built-in Jupyter magic commands provide visualizations of the monitored performance data directly in Jupyter. In addition, JUmPER preserves the exploratory programming experience by seamlessly integrating with Jupyter and reducing kernel runtime overhead through in-memory (pipe) communication and parallel marshalling of Python's interpreter state for the Score-P execution.
JUmPER thus provides a low-hurdle infrastructure for performance engineering in Jupyter and supports resource-efficient ML applications.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Memory Disaggregation in Serverless Computing
Hide Details
DescriptionIn recent years, the slowing of advancements in memory technology and applications’ increasing demand for memory have resulted in high performance computation becoming bottlenecked by availability of memory. One existing solution, far memory, involves swapping pages to a remote machine rather than a local disk. Function-as-a-service (FaaS) platforms have also become more prevalent, allowing the remote execution of workloads. We first explore the viability of integrating one FaaS tool, Globus Compute, with a remote swap system for far memory, FastSwap. Then, we investigate the performance of the combined system on various workloads to determine which ones can incorporate remote memory without excessive overhead cost. We find that for certain workloads, including breadth-first search and minimum spanning tree, it is possible to use up to 30% remote memory without significant slowdowns. In the poster session, we will present our approach, findings, limitations, and potential generalizations.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Evolving a Multi-Population Evolutionary-QAOA on Distributed QPUs
Hide Details
DescriptionOur research combines an Evolutionary Algorithm with a Quantum Approximate Optimization Algorithm (QAOA) to update the ansatz parameters, in place of traditional gradient-based methods, and benchmarks on the Max-Cut problem. We demonstrate that our Evolutionary-QAOA pairing performs on par or better than a COBYLA-based QAOA in terms of solution accuracy and variance, for d-3 regular graphs between 4 and 26 nodes, using Conditional Value at Risk for fitness function evaluations. Furthermore, we take our algorithm one step further and present a novel approach by presenting a multi-population algorithm distributed on two QPUs, which evolves independent and isolated populations in parallel, classically communicating elite individuals. Experiments were conducted on both simulators and quantum hardware, with investigations in the relative performance accuracy and variance.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

SWARM: Scientific Workflow Applications on Resilient Metasystem
Hide Details
DescriptionCurrent (centralized) resource management strategies typically require a global view of distributed HPC systems, relying on a cluster-wide resource manager for scheduling, with static, expert-tuned rules. This centralized decision-making approach suffers from resilience, efficiency and scalability issues. In this work, we describe our initial progress in the SWARM project that takes a novel decentralized multi-agent approach leveraging Swarm Intelligence (SI) and consensus strategies for enhanced robustness, resilience, and fault tolerance. We present our foundational SWARM system model to improve network overlays, enhance job selection using multi-agent consensus algorithms, and design SI-inspired scheduling approaches.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Benchmarking and Modeling of Producer-Consumer Data Movement Performance in Scientific Workflows
Hide Details
DescriptionIn this poster, we present the Analytics4X (A4X) framework, a workflow framework that enables systematic studies, evaluation, and modeling of HPC/HTC workflow data movement in a flexible and controlled environment. We apply A4X to an in situ molecular dynamics workflow to assess the performance trade-offs of two data management solutions: DYAD and Lustre. Through this assessment, we illustrate the importance of selecting a data solution that optimizes for both data movement and producer-consumer synchronization.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Cluster-Based Methodology for Characterizing the Performance of Portable Applications
Hide Details
DescriptionThis work focuses on performance portability and proposes a methodological approach to assessing and explaining how different kernels behave across various hardware architectures using the RAJA Performance Suite (RAJAPerf). Our methodology leverages metrics from the Intel top-down pipeline and clustering techniques to sort the kernels based on performance characteristics. We assess the methodology on 54 RAJAPerf computational kernels on Intel Xeon and NVIDIA V100 platforms. Our results confirm the effectiveness of our methodology in automatically characterizing performance differentials and speedups, particularly in memory-bound kernels.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

DART-X: Software Infrastructure for Prototyping In-Memory Data Transfer Between Ensemble Data Assimilation and Coupled Earth Systems Models
Hide Details
DescriptionData assimilation (DA) integrates real-world data into coupled climate models, enhancing prediction accuracy and capturing Earth system complexity. NCAR’s Data Assimilation Research Testbed (DART) is an ensemble DA tool for climate predictions with NCAR’s Community Earth System Model (CESM). Traditionally, DART has modified "restart" files written to disk to influence models, involving significant I/O and stop/restart processes, which are computationally expensive, even on large supercomputers.
This work explores using the National Unified Operational Prediction Capability (NUOPC) layer to enable direct in-memory data transfer between DART and CESM. We describe the development of a NUOPC cap for DART, focusing on strategies for full software integration and minimal disruption to existing functionalities. The infrastructure addresses software incompatibilities and includes decisions on tools, frameworks, and workflow optimizations. This approach aims to enhance efficiency and scalability in data assimilation, offering the first prototype for in-memory data transfer between DART and CESM.
This work explores using the National Unified Operational Prediction Capability (NUOPC) layer to enable direct in-memory data transfer between DART and CESM. We describe the development of a NUOPC cap for DART, focusing on strategies for full software integration and minimal disruption to existing functionalities. The infrastructure addresses software incompatibilities and includes decisions on tools, frameworks, and workflow optimizations. This approach aims to enhance efficiency and scalability in data assimilation, offering the first prototype for in-memory data transfer between DART and CESM.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

HPC Fastpass: Visualizing Descriptive and Predictive HPC Queue Time Data
Hide Details
DescriptionWith large HPC systems, users will often jockey for better queue times to get quicker results. Unfortunately, getting accurate estimations of queue times requires understanding complex and abundant data collected from myriad HPC system loggers. To aid with this, researchers are exploring machine learning to shortcut the analysis of these factors and give discrete predictions. Unfortunately, these models are imperfect, expressing varying degrees of accuracy. This imperfection must be conveyed to users in the form of uncertainty quantification. Thus, to provide users with a better understanding of queue wait times on NREL's Eagle HPC system, we developed a visualization that simplifies this complex data and aids decision making. This visualization summarizes uncertainty information associated with a user's specific queue time prediction and places it into the larger context of historical data, encoding job submission variables that users can change to show the impact of their choices on queue wait time.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX
Similar Presentations

MIGnificient: Fast, Isolated, and GPU-Enabled Serverless Functions
Hide Details
DescriptionEmerging applications in machine learning and personalized medicine introduce new challenges and requirements for secure computing. While the exclusive allocation of resources to a single tenant provides necessary isolation, it also comes at the cost of hardware underutilization. While solutions like containers allow for secure sharing of CPUs, new techniques are needed to efficiently co-locate applications on GPUs. We propose a new approach that merges the elasticity of the Function-as-a-Service (FaaS) with the physical GPU partitioning of NVIDIA MIG. In MIGnificient, we provide spatial isolation through concurrent execution on different device partitions, preventing side-channel attacks and performance interference. We employ local API remoting that controls kernel scheduling and memory transfers, enabling compute-communication overlap and improved resource management in virtualized API. MIGnificient overcomes the limitations of state-of-the-art solutions that rely on slower network-based API remoting and insecure NVIDIA MPS, creating a unifying model for optimized serverless GPU functions.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

iSeeMore: Design of a 256-Node RPi Cluster to Visualize LLM Computation Through Light and Movement for Mass Audiences
Hide Details
DescriptioniSeeMore is a kinetic cluster of 256-Raspberry Pi (RPi) computers that visually realizes supercomputing concepts (parallelism, data flow, algorithms) through servo and LED driven movement. In this poster, we describe the design of iSeeMore, the first large-scale cluster to combine movement and light in the service of educating audiences on the parallel algorithms and systems that form the underpinnings of everyday technologies we use today. We discuss core design decisions, software features for synchronizing LED hats to computation/movement, the approach to visually demonstrating parallel AI/ML concepts (e.g., LLMs), and the plan to showcase iSeeMore to large audiences.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

5G in Practice: Measuring Emerging Wireless Technology in Rural Iowa for Edge Devices in Distributed Computation Workloads
Hide Details
DescriptionEdge computing, the notion of moving computational tasks from servers to the data-generating network edge, is an increasingly popular model for data processing. 5G wireless technologies offer an opportunity to enable complex distributed edge computing workflows by minimizing the overhead incurred in transmitting data to peer devices. In this work, we demonstrate the use and performance of edge devices in distributed computation workloads using Hadoop MapReduce on a cluster of six 5G-connected Raspberry Pis. Specifically, we first determine the network capabilities (i.e., latency and throughput) across millimeter wave (mmWave) 5G links and then analyze the scalability and performance of our cluster. Our experiment uses 5G radios at the Agricultural and Rural (ARA) Wireless Living Lab, spanning over six miles in diameter.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Trackable Agent-Based Evolution Models at Wafer Scale
Hide Details
DescriptionEmerging ML/AI hardware accelerators, like the 850,000 processor Cerebras Wafer-Scale Engine (WSE), hold great promise to scale up the capabilities of evolutionary computation. However, challenges remain in maintaining visibility into underlying evolutionary processes while efficiently utilizing these platforms' large processor counts. Here, we focus on the problem of extracting phylogenetic history. We present a tracking-enabled asynchronous island-based genetic algorithm (GA) framework for WSE hardware. Emulated and on-hardware GA benchmarks with a simple tracking-enabled agent model clock upwards of 1 million generations per minute for population sizes reaching 16 million. We validate phylogenetic reconstructions from these trials and demonstrate their suitability for inference of underlying evolutionary conditions. In particular, we demonstrate extraction of clear phylometric signals that differentiate adaptive dynamics. Kernel code implementing the island-model GA supports drop-in customization to support any fixed-length genome content and fitness criteria, benefiting further explorations within the evolutionary biology and evolutionary computation communities.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

New Semi-Implicit Electrostatic Particle-In-Cell Method to Extend Scope of the Exascale WarpX Code
Hide Details
DescriptionTime-explicit fully-kinetic simulation of magnetically confined fusion plasmas is out of reach of existing supercomputers due to the multi-scale nature of the system. Physics approximations and time-implicit methods are typically used to simulate such plasmas. A new energy-conserving, semi-implicit Poisson solver has been developed and added to the open-source particle-in-cell code WarpX. The new solver enables electrostatic simulations to be performed using orders of magnitude less computational resources than previously possible, without reducing accuracy. Accuracy and computational speedup of the model is demonstrated by simulation of plasma expansion into vacuum and the spoke mode formation during Penning discharge (relevant to the plasma processing industry). The reduction in time-to-solution allows researchers to tackle problems that were previously computationally infeasible. Performance on GPUs and issues of scaling to fusion plasma conditions are discussed.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

KVSort: Drastically Improving LLM Inference Performance via KV Cache Compression
Hide Details
DescriptionLarge language model (LLM) deployment necessitates high inference throughput due to the increasing demand for text generation. To accelerate inference, the prefill mechanism avoids repeated computations via introducing KV Cache (in HBM). However, the KV cache size increases with the input and generated text length, causing insufficient GPU memory and slow KV fetching. To address these issues, existing approaches compress the KV cache using prune-based mechanisms that only keep the important KV vectors in the cache. However, their compression ratio is limited because it is necessary to preserve inference accuracy in the accuracy-compression ratio tradeoff. To improve the compression ratio, we introduce KVSort, a novel framework that utilizes error-bounded lossy compression on sorted KV vectors. The evaluation shows that KVSort achieves up to 52x compression ratio and 6.8x end-to-end inference performance improvement, compared to a state-of-the-art approach that achieves 20x compression ratio and 5.5x end-to-end inference throughput.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Web-Based Simulator of Superscalar RISC-V Processors
Hide Details
DescriptionUnlock the power of superscalar processor design with our cutting-edge RISC-V simulator! Tailored for IT students, researchers, and HPC professionals, this web-based tool brings complex architectures to life with an intuitive, customizable interface. Explore processor components, tweak configurations, and benchmark code snippets—all from your browser.
The simulator offers seamless support for C and assembly programs, built-in performance metrics, and full GCC compiler integration for various optimization levels. Whether you're learning or innovating, this tool enables you to experiment with different architectural setups, analyze results, and export configurations for sharing.
Designed to deepen your understanding of processor design and HW-SW co-design, the simulator supports both interactive exploration and batch processing via command-line. Perfect for those aiming to optimize RISC-V processors and HPC codes, it’s more than just a learning tool—it’s a powerful platform for research and development. Get ready to elevate your skills and performance optimization with this advanced simulator!
The simulator offers seamless support for C and assembly programs, built-in performance metrics, and full GCC compiler integration for various optimization levels. Whether you're learning or innovating, this tool enables you to experiment with different architectural setups, analyze results, and export configurations for sharing.
Designed to deepen your understanding of processor design and HW-SW co-design, the simulator supports both interactive exploration and batch processing via command-line. Perfect for those aiming to optimize RISC-V processors and HPC codes, it’s more than just a learning tool—it’s a powerful platform for research and development. Get ready to elevate your skills and performance optimization with this advanced simulator!
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Quantum Volume Benchmarking Simulators on HPC Systems
Hide Details
DescriptionQuantum computing simulators via classical computing are essential for small-scale comprehension and testing of quantum computing algorithms. Quantum Volume (QV) is a well-established benchmark for comparing Quantum Processing Units (QPUs) in the NISQ era. However, there is no QV benchmark of a large variety of current quantum simulators. This poster compares quantum computing simulators running on a single CPU or GPU-based HPC node using the Quantum Volume benchmark. As simulators do not suffer from noise, the metric used in the comparison is the time required to simulate a set Quantum Volume. In the poster session, we will provide further differentiating information about each simulator, and can provide more detail into how the proof of HOP convergence, a key component in QV testing with noiseless simulators, works.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Simplifying HPC Resource Selection: A Tool for Optimizing Execution Time and Cost on Azure
Hide Details
DescriptionAzure Cloud offers a wide range of resources for running HPC workloads, requiring users to configure their deployment by selecting VM types, number of VMs, and processes per VM. Suboptimal decisions may lead to longer execution times or additional costs for the user. We are developing an open-source tool to assist users in making these decisions by considering application input parameters, as they influence resource consumption. The tool automates the time-consuming process of setting up the cloud environment, executing the benchmarking runs, handling output, and providing users with resource selection recommendations as high-level insights on run times and costs across different VM types and number of VMs. In this work, we present initial results and insights on reducing the number of cloud executions needed to provide such guidance, leveraging data analytics and optimization techniques with two well-known HPC applications: OpenFOAM and LAMMPS.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Exploring DAOS as a Burst Buffer for a 100 Gbps DAQ Real-Time Streaming System
Hide Details
DescriptionWe present an experimental evaluation of a burst buffer for a real-time DAQ streaming system designed to transmit instrument data to remote data centers. The system is based on EJ-FAT, a load balancing system capable of Nx 100Gbps streams, distributing data from event sources to processing nodes. We explore applying the DAOS system as a burst buffer to serve a number of purposes: improve resiliency, elasticity and add new functions into the processing pipeline. In the evaluation a sender transmits events over a 100Gbps network to a receiver integrated with DAOS to store the reassembled events using DAOS APIs. We evaluate the system for possible bottlenecks and provide end-to-end evaluation with a burst buffer using DAOS storage abstractions. We show that a receiver node can support 38.1 Gbps. This proves the viability of our approach and allows us to extend this work to investigate scale-out properties and new streaming optimizations.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Persistent and Partitioned MPI for Stencil Communication
Hide Details
DescriptionMany parallel applications rely on iterative stencil operations, whose performance is dominated by communication costs at large scales. Several MPI optimizations, such as persistent and partitioned communication, reduce overheads and improve communication efficiency through amortized setup costs and reduced synchronization of threaded sends. This paper presents the performance of stencil communication in the Comb benchmarking suite when using non-blocking, persistent, and partitioned communication routines. The impact of each optimization is analyzed at various scales. Further, the paper presents an analysis of the impact of process count, thread count, and message size on partitioned communication routines. Measured timings show that persistent MPI communication can provide a speedup of up to 37% over the baseline MPI communication, and partitioned MPI communication can provide a speedup of up to 68%.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

An Adaptive Kernel Execution for Dynamic Applications on GPUs Using CUDA Graphs
Hide Details
DescriptionWe propose a novel approach for executing dynamic applications on GPUs. Different from traditional approaches that use a single kernel, our method allows the GPU to autonomously allocate computational resources at runtime. We decompose a kernel into multiple fragment kernels and dynamically launch an optimal number of them during execution. The input data is partitioned into smaller segments and each fragmented kernel processes each of the partitioned segments. This method is implemented using CUDA graphs conditional nodes for determining the number of fragmented kernels to be launched based on the input size. We compared the proposed method with the traditional kernel execution method with a Breadth-First Search (BFS) application, a representative dynamic application. Results show comparable performance while reducing utilization of compute resources by up to 19.9%, and opportunities to further performance improvement by optimizing parameters of our method.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX
Similar Presentations

Mind Your Manners: Detoxifying Language Models via Attention Head Intervention
Hide Details
DescriptionTransformer-based Large Language Models have advanced natural language processing with their ability to generate fluent text. However, these models exhibit and amplify toxicity and bias learned from training data, posing new ethical challenges. We build upon the \attnlens{} framework to allow for scalable decoding of attention mechanism information. We then use this decoded information to implement a pipeline to generate and remove toxic memories from pre-trained language models in a way that is both human interpretable and effective while retaining model performance.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Bringing It HOME: Analyzing Contention Hotspots Across the Memory Hierarchy with Low Overhead
Hide Details
DescriptionThe increasing demand for computing in scientific research has given rise to memory contention and performance bottlenecks. Existing solutions often carry high overheads or lack the necessary detail for effective contention mitigation. To tackle these challenges, we are developing a powerful tool, HOME (Hierarchy-Oriented Memory Evaluation), which can efficiently identify contention by capturing detailed load-store traces and passing them to configurable memory hierarchy models.
HOME helps programmers identify the code regions that create contention at various memory hierarchy levels. This enables developers to redesign applications and optimize memory hierarchy designs efficiently. Our preliminary assessments indicate that HOME can save time by up to 50x compared to the state-of-the-art, with an average error rate of 6.51%. We also provide solutions for mitigating the impact of sample drops on contention analysis, a common issue in trace-based analysis.
HOME helps programmers identify the code regions that create contention at various memory hierarchy levels. This enables developers to redesign applications and optimize memory hierarchy designs efficiently. Our preliminary assessments indicate that HOME can save time by up to 50x compared to the state-of-the-art, with an average error rate of 6.51%. We also provide solutions for mitigating the impact of sample drops on contention analysis, a common issue in trace-based analysis.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Active Learning for Metamaterial Optimization on HPC and QC Integrated Systems
Hide Details
DescriptionActive learning algorithms, integrating machine learning, quantum computing and optics simulation in an iterative loop, offer a promising approach to optimizing metamaterials. However, these algorithms can face difficulties in optimizing highly complex structures due to computational limitations. High-performance computing (HPC) and quantum computing (QC) integrated systems can address these issues by enabling parallel computing. In this study, we develop an active learning algorithm working on HPC-QC integrated systems. We evaluate the performance of optimization processes within active learning (i.e., training a machine learning model, problem-solving with quantum computing, and evaluating optical properties through wave-optics simulation) for highly complex metamaterial cases. Our results showcase that utilizing multiple cores on the integrated system can significantly reduce computational time, thereby enhancing the efficiency of optimization processes. Therefore, we expect that leveraging HPC-QC integrated systems helps effectively tackle large-scale optimization challenges in general.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

An Error-Bounded Lossy Compression Method with Bit-Adaptive Quantization for Particle Data
Hide Details
DescriptionWe present error-bounded lossy compression tailored for particle datasets from diverse scientific applications in cosmology, fluid dynamics, and fusion energy sciences. As today's high-performance computing capabilities advance, these datasets often reach trillions of points, posing significant analysis and storage challenges. While error-bounded lossy compression makes it possible to represent floating-point values with strict pointwise accuracy guarantees, the lack of correlations in particle data's storage ordering often limits the compression ratio. Inspired by quantization-encoding schemes in SZ lossy compressors, we dynamically determine the number of bits to encode particles of the dataset to increase the compression ratio. Specifically, we utilize a k-d tree to partition particles into subregions and generate "bit boxes" centered at particles for each subregion to encode their positions. These bit boxes ensure error control while reducing the bit count used for compression. We evaluate our method against state-of-the-art compressors on cosmology, fluid dynamics, and fusion plasma datasets.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Creating Code LLMs for HPC: It’s LLMs All the Way Down
Hide Details
DescriptionLarge language models (LLMs) are being used increasingly by software developers, researchers, and students to assist them in coding tasks. While newer LLMs have been improving their coding abilities with regards to serial coding tasks, they consistently perform worse when it comes to parallelism and HPC-related coding tasks. Bridging this gap and creating HPC-capable code LLMs could drastically improve the quality and quantity of code research software developers can write. The current poor performance of LLMs on HPC-related problems can be partially attributed to the lack of significant HPC data in their training, which is what we address in this poster. We present HPC Coder v2, a new LLM created by fine-tuning a previous code LLM using HPC synthetic data. We demonstrate that it is one of the most capable open-source LLMs for generating parallel code to date.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Towards Scalable Quantum Simulation on Wafer-Scale Engines
Hide Details
DescriptionQuantum computing poses many benefits to the computing world, given its efficiency over classical computers in problem spaces. However, the high noise and low logical qubit count of contemporary noisy-intermediate-scale-quantum (NISQ) devices make the development and execution of quantum algorithms difficult. As such, developers and researchers have used classical simulation to prototype and validate their algorithms, often utilizing specialized classical hardware such as GPUs and FPGAs for their parallelization capabilities. We propose an optimized and scalable method for quantum simulation using the complex general matrix-vector product (GEMV) operation on the Cerebras wafer-scale engine (WSE) architecture. We experimentally determine the scalability of our method for differing qubit counts. Finally, we demonstrate viability and scalability by simulating practical quantum image processing circuits using our method.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

I/O Characterization of Heterogeneous Workflows
Hide Details
DescriptionWorkflows consist of individual applications such as scientific simulations and data analytics. These applications constitute different stages of the workflow, each comprising heterogeneous characteristics such as run-times and system requirements. The heterogeneity in these workflow stages dictates the need to efficiently characterize them in terms of I/O to provide insights that can lead to informed decisions for their optimization. In this work we have analyzed the run-times of the workflows Montage, 1000 Genome and MuMMI and have categorized their stages as I/O or Non-I/O bound. For the I/O bound stages we perform a detailed analysis of their bandwidth and resource requirements. Our findings conclude that Montage's mBgModel could benefit from Dynamic Resource Scheduling, while Genome's individuals_merge could benefit from data aggregations in the PFS requests and the usage of isolated storage solutions such as node-local storage. These optimizations could aid in serving the bandwidth requirements of this workflow stage.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

RAPIDS: Reduced API Data-Transfer Specifications
Hide Details
DescriptionPerformance of all-purpose communication libraries like MPI is fundamentally limited by the all-in-one approach these libraries use. RAPIDS (Reduced API Data-transfer Specifications) divides the functionality of these libraries into separate, more focused APIs, enabling library and application developers to avoid costly overhead of functionality they don’t use. This approach is highly adaptive and will evolve alongside modern GPUs, DPUs, and other accelerators.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

A Zero-Copy Storage with Metadata-Driven File Management Using Persistent Memory
Hide Details
DescriptionPersistent Memory (PM) is a promising next-generation storage device, combining features of both volatile memory (like DRAM) and non-volatile memory (like SSDs). Many studies use PM to optimize training to advance deep learning technology. However, these studies have not addressed the issue of multiple copies of training data during deep learning, leading to reduced training efficiency. In this study, we first analyze the characteristics of PM and mainstream file systems. We then explore PM's byte addressability to manage metadata and data efficiently. This approach minimizes multiple I/O operations of tasks involving repeated read-write data accesses, such as machine learning datasets, enabling zero-copy data handling and significant speedups of read-and-write operations.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Exploring Software-Defined Networking for Routing in Dragonfly Topology
Hide Details
DescriptionWhile efficient routing in Dragonfly networks presents significant challenges, the advent of Software-defined Networking (SDN) offers new opportunities for routing optimization by providing a global network view. This research proposes an SDN-based adaptive routing approach for Dragonfly interconnects. By leveraging global traffic information from SDN, our approach identifies and avoids persistent congestion points that may occur with conventional UGAL routing, leading to improved resource utilization and enhanced performance. This study addresses a critical gap in the development of efficient adaptive routing solutions using SDN technology.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Cluster Management with Containerization on Switches
Hide Details
DescriptionThis research explores the utilization of underused computational resources in network switches from Arista and Mellanox by deploying containers for auxiliary tasks in computational cluster networks. We tested five scenarios: 1) running cloud-init services for efficient boot processes, 2) using Telegraf for network monitoring, 3) deploying a caching proxy to reduce latency, 4) setting up an IPv6 DHCP/DNS provider for VLANs, and 5) implementing client detection with Magellan for network topology mapping. Containers were deployed using Podman and Docker on SONiC-based switches and tested both physically and virtually. Results demonstrated the feasibility and benefits of this approach, which optimizes network performance and reduces server load, offering a cost-effective enhancement to HPC clusters. Future work can expand this research to include additional network management and security tasks directly on the switches.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Benchmarking Quantum-Inspired Optimization Platforms and Tools on an HPC Cluster
Hide Details
DescriptionThis work compares performance of various QUBO software tools on different platforms available on the Sol supercomputer at Arizona State University. CPU, GPU and the NEC vector engine (VE) card provide various means to implement these computations on simulated qubits while employing various solvers, including simulated quantum annealing. As current quantum hardware cannot reach the scale of many real-world problems, these simulations give a sense of the potential future technology.
Although this particular work is complete, there remains potential to expand on it by researching other types of NP-Hard optimization problems and implementing other solvers. We will present plots illustrating the results of the performance and accuracy benchmarking.
Although this particular work is complete, there remains potential to expand on it by researching other types of NP-Hard optimization problems and implementing other solvers. We will present plots illustrating the results of the performance and accuracy benchmarking.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Development of TEZip in PyTorch: Integrating New Prediction Models into an Existing Compression Framework
Hide Details
DescriptionIn high performance computing, researchers often work with extremely large time-series datasets. Compression techniques allow them to shrink their data down, allowing for quicker and less storage-intensive transfers. TEZip is a compression model that utilizes PredNet (a video prediction model) to predict each frame of a time-series dataset, subtracting these predictions from the actual frames and performing further encoding operations. However, TEZip is currently built using TensorFlow and only supports the PredNet model, which trains and predicts slowly. In this work, we rebuilt TEZip in order to accommodate for PyTorch models while also including functionality for PredNet as well as ConvLSTM, a simpler time-series prediction model. We found that our PyTorch version (specifically with the ConvLSTM model) results in faster compression and decompression times. This work is significant in extending the capabilities of TEZip and suggests that simple prediction models are worth exploring in the realm of prediction-based compression.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Improvement of Bridges-2 Resource Utilization Through User Optimization
Hide Details
DescriptionThis poster presents our two-phase solution for improving GPU utilization in NSF-funded ACCESS high-performance computing (HPC) clusters, with a pilot implementation on Pittsburgh Supercomputing Center’s Bridges-2. Our approach addresses the limitations of Open XdMoD, which lacks per-job GPU usage monitoring and experiences delays in data availability. In phase one, we develop a data ingestion layer to collect GPU indices and resource usage data, utilizing existing software tools for efficient data aggregation and analysis. Analyzing 5,717 completed GPU jobs revealed issues such as workflow configuration errors, framework misconfigurations, and low GPU utilization. In phase two we create a user-facing platform with modern web tools. This platform will automatically detect inefficiencies, notify users via email, and provide actionable insights to optimize resource management. By addressing these issues and integrating real-time data presentation, we aim to enhance overall system utilization, reduce GPU job wait times, and enable more efficient use of existing resources.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Profiling and Bottleneck Identification for Large Language Model Optimizations
Hide Details
DescriptionLarge language models (LLMs) have shown they can perform scientific tasks. They are capable of assisting researchers in data interpretation, instrument operation, knowledge synthesis, and hypothesis generation. However, LLMs must first be trained on a large dataset of scientific tasks and data. Training these models requires a substantial amount of time, energy, and computational resources, as the process of altering a model’s parameters through each iteration is expensive. Researchers have developed optimizations that can speed up the process of training LLMs with new data. In our research, we aim to profile LLMs with optimizations during the steps of fine-tuning to identify bottlenecks or improvements in runtime. Some of the optimizations we utilized include Low-Rank Adaptation (LoRA), BitFit, and Adapter. From our visual diagrams and runtime charts, we can gain a better understanding of their performance and profile breakdown during training and fine-tuning.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

A Survey-Based Evaluation of the Efficacy of a Girls Who Code Club at the University of Southern Indiana
Hide Details
DescriptionPrograms like Girls Who Code (GWC) are pivotal in working to inspire and equip young women with the skills and confidence needed to pursue careers in computing. Understanding the impact of such initiatives is particularly important for addressing the decline in interest among girls aged 13 to 17, a critical period for career decision-making. By evaluating the effectiveness of employing a GWC club at our university, this research aims to uncover strategies that can successfully attract and retain women in computer science (CS) in our region. The goal is to not only reverse the trend of declining female participation but also to sustain their interest in the field of computing.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Improving SpGEMM Performance Through Reordering and Cluster-Wise Computation
Hide Details
DescriptionSparse Matrix-Matrix Multiplication (SpGEMM) is a key kernel in many scientific applications and graph workloads. SpGEMM is known to suffer from poor performance due to irregular memory access patterns. Gustavson's algorithm, a traditional approach for SpGEMM, involves row/column-wise operations, facing challenges with irregular accesses to the second matrix. Our research focuses on enhancing memory locality through matrix reordering and cluster-wise computation to address this issue.
In this study, we evaluate the effect of 10 different reordering algorithms on SpGEMM performance. Then, we introduce a novel method that employs cluster-wise SpGEMM, merging similar rows into clusters. Our findings show that matrix reordering can improve SpGEMM performance by up to 2.3×, and our cluster-wise approach can further enhance performance by up to 30%.
In this study, we evaluate the effect of 10 different reordering algorithms on SpGEMM performance. Then, we introduce a novel method that employs cluster-wise SpGEMM, merging similar rows into clusters. Our findings show that matrix reordering can improve SpGEMM performance by up to 2.3×, and our cluster-wise approach can further enhance performance by up to 30%.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Prompt Phrase Ordering Using Large Language Models in HPC: Evaluating Prompt Sensitivity
Hide Details
DescriptionLarge language models (LLMs) often require well-designed prompts for effective responses, but optimizing prompts is challenging due to prompt sensitivity, where small changes can cause significant performance variations. This study evaluates prompt performance across all permutations of independent phrases to investigate prompt sensitivity and robustness. We used two datasets: GSM8k, for mathematical reasoning, and a custom prompt for summarizing database metadata. Performance was assessed using the llama3-instruct-7B model on Ollama and parallelized in a high-performance computing environment. We compared phrase indices in the best and worst prompts and used Hamming distance to measure performance changes between phrase orderings. Results show that prompt phrase ordering significantly affects LLM performance, with Hamming distance indicating that changes can dramatically alter scores, often by chance. This supports existing findings on prompt sensitivity. Our study highlights the challenges in prompt optimization, indicating that modifying phrases in a successful prompt does not guarantee another successful prompt.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Proposal for a Parallel Automatic Tuning Using d-Spline According to the Operating State of the Computer System
Hide Details
DescriptionSoftware auto-tuning (AT) is a technology that parameterizes factors that affect performance as "performance parameters" and automatically tunes performance. In AT, the tool searches for better values for performance parameters by repeatedly running the program. Therefore, if the target is a program with long execution times, such as machine learning, AT will take a very long time. For this problem, we have tried to reduce the time by running the target program in parallel. However, simple parallelization does not always take full advantage of the parallelism of the computer system.
In this study, we proposed the system-resource-based search. This method increases the number of execution targets, and the system does not have excess computing resources. The system-resource-based search was independent of the size of the search space and made the best use of the computational resources available on the supercomputer.
In this study, we proposed the system-resource-based search. This method increases the number of execution targets, and the system does not have excess computing resources. The system-resource-based search was independent of the size of the search space and made the best use of the computational resources available on the supercomputer.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Assessing Matrix Multiplication Performance with Fully Homomorphic Encryption
Hide Details
DescriptionAs data from various domains is increasingly shared and processed in the cloud, Homomorphic Encryption (HE) provides a crucial solution for ensuring privacy in the post-quantum era. In this work, we evaluate the performance and accuracy of the HE matrix multiplication leveraging SEAL
library kernels. Moreover, we compare it against EVA, which offers optimized HE parameters to conduct this operation.
Not only performance results are shown in our poster — check out how working with more appropriate parameters reduces the execution time while keeping the accuracy of the result.
library kernels. Moreover, we compare it against EVA, which offers optimized HE parameters to conduct this operation.
Not only performance results are shown in our poster — check out how working with more appropriate parameters reduces the execution time while keeping the accuracy of the result.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX
Similar Presentations

Generating Coupled Cluster Code for Modern Distributed Memory Tensor Software
Hide Details
DescriptionScientific groups are struggling to adapt their codes to quickly-developing GPU-based HPC platforms. The domain of distributed coupled cluster (CC) calculations is not an exception. Moreover, our applications to tiny QED effects require higher-order CC which include thousands of tensor contractions, which makes automatic treatment imperative.
The challenge is to allow efficient implementation by capturing key symmetries of the problem, while retaining the abstraction from the hardware. We present the tensor programming framework tenpi, which seeks to find this balance. It features a Python library user interface, global optimization of intermediates, a visualization module and Fortran code generator that bridges the DIRAC package for relativistic molecular calculations to tensor contraction libraries. tenpi brings higher-order CC functionality to the massively parallel module of DIRAC. The architecture and design decision schemes are accompanied by benchmarks and by first production calculations on Summit, Frontier and LUMI along with state-of-the-art tensor contraction software.
The challenge is to allow efficient implementation by capturing key symmetries of the problem, while retaining the abstraction from the hardware. We present the tensor programming framework tenpi, which seeks to find this balance. It features a Python library user interface, global optimization of intermediates, a visualization module and Fortran code generator that bridges the DIRAC package for relativistic molecular calculations to tensor contraction libraries. tenpi brings higher-order CC functionality to the massively parallel module of DIRAC. The architecture and design decision schemes are accompanied by benchmarks and by first production calculations on Summit, Frontier and LUMI along with state-of-the-art tensor contraction software.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Formal Approaches to Characterize Emerging Arithmetic Realizations
Hide Details
DescriptionAs HPC models grow increasingly complex, disparities in floating point implementations across hardware platforms begin to pose significant challenges to reproducibility and reliability. This is especially so, given that HPC employs hardware optimized for performance, which quite often deviates from the IEEE Standard. We leverage SMT solvers, particularly Z3, to develop a rigorous framework for analyzing and verifying the behavior of computer arithmetic implementations in emerging hardware realizations. Using bit-vectors to model IEEE non-standard behaviors, we are able to formally reason about intricate deviations in areas such as rounding rules, subnormal number handling, precision, normalization, etc. We demonstrate the framework's utility in two key applications: automating feature-targeted hardware testing for undocumented features and uncovering the degree of conformance to deeper properties such as monotonicity within these non-standard arithmetics. Our work also directly benefits cutting-edge GPU implementations, which is a timely issue underlying trust in scientific computation.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Integrating HPCToolkit with Tools for Automated Analysis
Hide Details
DescriptionHPCToolkit enables users to gather detailed information about application performance. Users can capture fine-grained measurement data, which may include instruction-level samples on CPUs and GPUs. Collected data can be huge, making manual inspection using GUI tools difficult and time-consuming. We explored existing tools Hatchet and Thicket for programmatic analysis of performance data to automate this process. However, they were not designed to handle data as large as HPCToolkit's. HPCToolkit's calling context trees are difficult to interpret and visualize using these tools because of their overwhelming detail. Moreover, importing multiple trees into Thicket can be slow, as unifying trees is costly when trees are large. To reduce the size of large trees, we implemented heuristics that would automatically detect and remove specific code regions. After creating smaller trees that we believe contain all the meaningful information about the program's behavior, we used Thicket to analyze multiple performance profiles measured by HPCToolkit.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Performance of LAMMPS-SNAP in Different Runtime Environments
Hide Details
DescriptionLAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) is a widely used molecular dynamics simulator. It is used here to simulate the high-pressure BC8 phase of carbon using the Spectral Neighbor Analysis Potential (SNAP). This simulation employs the Kokkos C++ performance portability layer for its inter-atomic potential calculations in SNAP on GPUs. We evaluate LAMMPS’ performance across different programming environments and MPI implementations on two leadership-class supercomputers, Perlmutter at NERSC and Frontier at OLCF. Additionally, we analyze performance trends within containerized environments. Our systematic empirical study assesses various configurations on these systems to provide insights and recommendations for optimizing application performance. This study aims to guide users in selecting the most effective setup for their application.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Algorithmic Patterns from Computational Biology for Proxy Application Development and Co-Design
Hide Details
DescriptionHigh-performance computing hardware is co-developed with U.S. DOE codes, and proxy applications (apps) based on these codes are critical technologies for iterative innovation. Numerical modeling/simulation proxies have been the most impactful in co-design. To broaden the types of computation available for co-design, we are developing proxy apps based on MetaHipMer (mhm2), a DOE-developed, scalable, \textit{de novo} metagenome assembler. MetaHipMer is implemented in C++, and offloads several routines to GPU. It has been used to assemble large (>50 Terabase) metagenomes on exascale-class machines (e.g., Summit). Our first proxy focuses on the expensive kmer analysis step. This and subsequent steps are memory-bound computations using CPU shared-memory distributed data structures (e.g., distributed hash tables). These data structures are often larger than inputs, and operations on them account for most of runtime. Our proxies will be implemented in Kokkos, a C++ performance portability programming model for emerging architecture design/testing.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Performance of N10 Benchmarks with Different BLAS Implementations
Hide Details
DescriptionThe NERSC-10 Benchmark Suite is a collection of tests which are designed to evaluate various aspects of system architecture and performance. This study utilizes the suite to evaluate the CPU compute nodes of two leadership-class supercomputers, Perlmutter at NERSC and Summit at OLCF, using two N10 benchmarks. We compare different BLAS implementations, including vendor-provided options and open-source alternatives. Our analysis reveals performance difference between vendor-provided and open-source BLAS implementations by leveraging two benchmarks from this suite. The results offer valuable insights for users, highlighting which BLAS implementations may be optimal for compute-bound applications on each system.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Eve: Less Memory, Same Might
Hide Details
DescriptionAdaptive optimizers, which adjust the learning rate for individual parameters, have become the standard for training deep neural networks. AdamW is a popular adaptive method that maintains two optimizer state values (momentum and variance) per parameter, doubling the model’s memory usage during training. Many proposed memory efficient optimizers claim to match AdamW’s performance but lack its desirable qualities such as robustness to learning rate changes. This quality is especially desirable when pre-training LLMs, where experimenting with different hyperparameters is infeasible. We propose Eve, a Memory Efficient AdaptiVe Moment Estimation algorithm that saves memory by reducing the variance term while also preserving AdamW’s desirable properties across different training settings. We fine-tune Llama 2 70B on 64 GPUs and show memory savings of 20% compared to AdamW. We also compare our method to a recent well-received memory efficient optimizer called Adam-mini and demonstrate better training stability across various learning rates.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Lagrangian Particle-Tracking in GPU-Enabled Extreme Scale Turbulence Simulations
Hide Details
DescriptionMany practical turbulent flow phenomena are naturally studied using a Lagrangian approach that treats the fluid medium as a collection of infinitesimal fluid particles. We present a GPU-accelerated algorithm for tracking particles in direct numerical simulations of isotropic turbulence, scaling up to 32768^3 using the world's first exascale computer (Frontier). Cubic spline interpolation is used to compute the particle velocity as the particles wander among sub-domains held by different parallel processes. We use a programming model that minimizes host-device data transfer by leveraging
memory parity between the CPU and GPU, reduces communication costs through a local decomposition for the particles, and uses OpenMP offloading on the GPU to accelerate the computation of cubic spline coefficients. The result is an algorithm shown to attain good weak scaling and strong scaling at problem sizes close to the capacity supported by the machine, at a cost nearly independent of the particle count.
memory parity between the CPU and GPU, reduces communication costs through a local decomposition for the particles, and uses OpenMP offloading on the GPU to accelerate the computation of cubic spline coefficients. The result is an algorithm shown to attain good weak scaling and strong scaling at problem sizes close to the capacity supported by the machine, at a cost nearly independent of the particle count.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Scalable Motif Counting on Large-Scale Dynamic Graphs
Hide Details
DescriptionMotifs, small subgraphs of k vertices, such as triangles and cliques, are well studied for static networks. They are used to characterize different biological networks and align different networks. Counting motifs reveal insights into the topological structure of a network such as MPI event graphs. However, for large networks and motifs, computing these frequencies is computationally expensive. Recent advances into sized-k or less motifs show promise but have difficulty scaling. Moreover, counting the frequency of all sized-k or less motifs on dynamic networks is still lacking.
We present a scalable method to compute the global edge-based frequencies of motifs of size 4 vertices or less on a fully dynamic network. Instead of recomputing the counts from scratch, we update the frequencies based on only the changed edges. Our results show that our method is scalable and outperforms the benchmark results by 10 times.
We present a scalable method to compute the global edge-based frequencies of motifs of size 4 vertices or less on a fully dynamic network. Instead of recomputing the counts from scratch, we update the frequencies based on only the changed edges. Our results show that our method is scalable and outperforms the benchmark results by 10 times.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Empowering Scientific Datasets with Large Language Models
Hide Details
DescriptionThe growing volume and complexity of scientific data pose significant challenges in data management, organization, and analysis. Our objective is to enhance the utilization of historical scientific datasets across various disciplines. To address this, we propose integrating large language models (LLMs) with databases to enable natural language queries, streamlined data retrieval, and analysis. Leveraging LangChain, our approach harnesses the capabilities of LLMs and complements them with data visualization and interpretation tools. Initial results using Llama 3.1 70B demonstrate an 88% success rate in searching and summarizing structured text and numerical data, showcasing the potential for LLM-powered tools to accelerate scientific discovery and innovation.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

DisCostiC: Simulating MPI Applications Without Executing Code
Hide Details
DescriptionWe present the cross-architecture parallel simulation framework DisCostiC (Distributed Cost in Clusters). It predicts the performance of real or hypothetical, massively parallel MPI(+X) programs on current and future supercomputer systems. The novelty of DisCostiC is that it employs analytical, first-principle performance models at full scale, including cores, chips, nodes, and networks, while being aware of bottlenecks such as memory bandwidth. DisCostiC uses application skeletons in a Domain-Specific Embedded Language (DSEL), which encodes inter-process dependencies and any number of system and code properties, enabling flexible design space exploration. As a consequence of the model-based design, DisCostiC requires much less time and resources than traditional simulators because the application code is never actually run. This is in contrast to state-of-the-art solutions, which are based on trace data and/or simulated code execution and may thus need considerable resources. The resulting traces can be visualized using Chromium Browser, ITAC, or Vampir.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Identifying Regions of Non-Determinism in HPC Simulations Through Event Graph Alignment
Hide Details
DescriptionHigh performance computing (HPC) applications using MPI (Message Passing Interface) often face non-determinism (ND) due to asynchronous MPI calls, making ND source identification challenging. Modeling execution as an event graph, where MPI calls are nodes and communication is edges, can be useful. Focusing on Message ND, which involves variability in MPI communication order across runs, we detect potential ND sources by comparing edge sets between event graphs. Accurate comparison requires aligning event graph nodes, but traditional methods like NetAlign, graphlet degree vectors, and Graph Auto-Encoders struggle due to the regularity of event graphs. We propose a meta graph heuristic utilizing structural constraints and a message passing scheme for sparse directed acyclic graphs, achieving up to 70% improvement in alignment accuracy over conventional techniques.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Fault Tolerance in Krylov Subspace Methods
Hide Details
DescriptionToday’s complex HPC systems are incredibly powerful yet equally likely to experience failures. The scientific applications on these HPC systems are mostly iterative in nature. Iterative solvers have some inherent fault tolerance, but they are still susceptible to errors. One subset of these iterative methods are the Krylov Subspace Methods. There has been limited research on the fault tolerance of these methods against soft errors. We know Preconditioned Conjugate Gradient (PCG) to be self-correcting in nature. But we don’t know much about other methods in the Krylov Subspace. Our goal is to study the error propagation caused by Sparse Matrix-Vector Multiplication (SpMV) operation in Lanczos Method, Bi-Conjugate Gradient (BiCG) Method and PCG Method. By using the results from the experiments and knowledge from previous works, we will generalize our findings for all the Krylov Subspace Methods.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Prototype Development and Testing of a Smart Buoy System for Coastal and Marine Ecosystems Using IBIS
Hide Details
DescriptionEffective environmental monitoring traditionally required physical presence at research sites. IoT technologies now enable remote data collection, revolutionizing this field. This work presents a prototype of smart buoys designed for coastal and marine ecosystems. Located in the IBIS testbed, these buoys are equipped with various sensors and Single Board Computers (SBCs) that not only collect real-time data but also offer significant growth potential. As the number of deployed buoys increases, they can integrate with supercomputing resources for advanced data processing and analysis. Initial tests demonstrate their ability to monitor environmental parameters accurately, enhancing weather forecasting, storm tracking, and maritime safety. The results underscore the potential of IoT-based smart buoys to advance remote monitoring, improve decision-making, and drive innovation in marine research and conservation.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

AI-Based Scalable Analytics for Improving Performance and Resilience of HPC Systems
Hide Details
DescriptionAs high-performance computing (HPC) advances to exascale levels, its role in scientific fields such as medicine, climate research, finance, and scientific computing becomes increasingly critical. However, these large-scale systems are susceptible to performance variations caused by anomalies, including network contention, hardware malfunctions, and shared resource conflicts. These anomalies can lead to increased energy consumption, scheduling inefficiencies, and reduced application performance. Therefore, accurately and promptly diagnosing these performance anomalies is essential for maintaining the efficiency and reliability of HPC systems. Machine learning offers a powerful approach to automating the detection of such anomalies by learning patterns from the vast amounts of complex telemetry data generated by these systems. Our research focuses on increasing the efficiency and resilience of HPC systems through automated telemetry analytics, and this poster presentation will summarize our efforts and findings in this domain.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX
Similar Presentations

Large Genomic Language Models: Towards Their Hyperparameter Optimization
Hide Details
DescriptionThis study explores hyperparameter optimization for encoder-only genomics large language models, balancing machine learning performance, hardware resource usage and power consumption. Multiple search techniques and objective functions were implemented, and from the results we obtained two types of models: a Compact and an Optimal model. Comprehensive tests were carried out to rank these models based on their model performance and resource usage.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX

Poseidon: A Source-to-Source Translator for Holistic HPC Optimization of Ocean Models on Regular Grids
Hide Details
DescriptionOcean simulation models often underperform on modern high-performance computing (HPC) architectures, necessitating costly and time-consuming code rewrites.
We introduce Poseidon, an HPC-oriented source-to-source translator for Fortran-based fluid dynamics solvers used in ocean and weather models with regular grid structures. Poseidon aims to recover high-level information and semantics lost during the process of converting numerics to source code.
We demonstrate Poseidon's approach using a research code implementing the 2D fast barotropic solver of full 3D ocean simulation models, which involves over 20 stencil-like kernels. Kernel fusion-based code optimization can already lead to a high combinatorial complexity.
Preliminary results include various performance studies with and without data flow graph-based modifications based on an exhaustive search for kernel fusion. Measurements show that Poseidon can generate optimized Fortran code.
In future work, Poseidon automatic code rewrite should help to: port existing code to GPU, hide process communications latency and apply automatic differentiation.
We introduce Poseidon, an HPC-oriented source-to-source translator for Fortran-based fluid dynamics solvers used in ocean and weather models with regular grid structures. Poseidon aims to recover high-level information and semantics lost during the process of converting numerics to source code.
We demonstrate Poseidon's approach using a research code implementing the 2D fast barotropic solver of full 3D ocean simulation models, which involves over 20 stencil-like kernels. Kernel fusion-based code optimization can already lead to a high combinatorial complexity.
Preliminary results include various performance studies with and without data flow graph-based modifications based on an exhaustive search for kernel fusion. Measurements show that Poseidon can generate optimized Fortran code.
In future work, Poseidon automatic code rewrite should help to: port existing code to GPU, hide process communications latency and apply automatic differentiation.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TimeTuesday, 19 November 202412pm - 5pm EST
LocationB302-B305
TP
XO/EX
Algorithmic and Optimization Techniques for Graph Applications in Heterogeneous Systems at Scale
Hide Details
DescriptionAs heterogeneity becomes commonplace in HPC systems, algorithmic and optimization techniques are needed to address the challenges that come with it, especially for irregular applications. This includes workload balancing, scheduling, latency tolerance, and memory utilization and contention, among others. This showcase covers three works addressing key questions in running complex irregular graph applications on heterogeneous systems: programmability, performance portability, memory efficiency, load balancing, and scalability.
The first work explores the efficacy of utilizing commercial high-level synthesis tools to accelerate two different graph sampling methods on FPGAs. We achieve up to a 40x speedup compared to the baseline OpenCL kernel, and identify key areas for toolchain improvements, such as memory subsystems and latency tolerance.
The second work focuses on improving breadth-first probabilistic traversals (BPTs), as they dominate runtime in some applications. By identifying and exploiting redundancies in edge accesses, we achieve an average of 75x and 135x speedups when deployed on two different frameworks. We also demonstrate strong scaling up to 4,096 nodes on OLCF Frontier enabled by CPU-GPU heterogeneous workload balancing.
The third work is currently in progress, exploring the use of lossy compression to enable training on graph neural networks. We have promising preliminary results, showing a compression ratio of between 6x-20x with minimal accuracy loss on both GCN and GAT. We identify future directions and use cases for this method with an emphasis on systems integration such as larger batch sizes in mini-batch training, compressing feature vector caches, and adaptive compression methods for heterogeneous and dynamic GNNs.
The first work explores the efficacy of utilizing commercial high-level synthesis tools to accelerate two different graph sampling methods on FPGAs. We achieve up to a 40x speedup compared to the baseline OpenCL kernel, and identify key areas for toolchain improvements, such as memory subsystems and latency tolerance.
The second work focuses on improving breadth-first probabilistic traversals (BPTs), as they dominate runtime in some applications. By identifying and exploiting redundancies in edge accesses, we achieve an average of 75x and 135x speedups when deployed on two different frameworks. We also demonstrate strong scaling up to 4,096 nodes on OLCF Frontier enabled by CPU-GPU heterogeneous workload balancing.
The third work is currently in progress, exploring the use of lossy compression to enable training on graph neural networks. We have promising preliminary results, showing a compression ratio of between 6x-20x with minimal accuracy loss on both GCN and GAT. We identify future directions and use cases for this method with an emphasis on systems integration such as larger batch sizes in mini-batch training, compressing feature vector caches, and adaptive compression methods for heterogeneous and dynamic GNNs.
Efficient, Scalable, Robust Neuromorphic High Performance Computing
Hide Details
DescriptionThe rapid advancement in Artificial Neural Networks (ANNs) has paved the way for Spiking Neural Networks (SNNs), which offer significant advantages in energy efficiency and computational speed, especially on neuromorphic hardware. My research focuses on the development of Efficient, Robust, and Scalable Heterogeneous Recurrent Spiking Neural Networks (HRSNNs) for high-performance computing, addressing key challenges in traditional digital systems, such as high energy consumption due to ADC/DAC conversions and vulnerability to process variations, temperature, and aging.
HRSNNs leverage the diversity in neuronal dynamics and Spike-Timing-Dependent Plasticity (STDP) to improve memory capacity, learn complex patterns, and enhance network performance. By incorporating unsupervised learning models and biologically plausible pruning techniques, we maintain network stability and computational efficiency. A notable contribution of this work is the introduction of Lyapunov Noise Pruning (LNP), which leverages temporal overparameterization to achieve significant reductions in network complexity without compromising accuracy.
Our approach also explores DNN-SNN hybrid models, which combine the strengths of deep neural networks and spiking networks for tasks such as object detection, demonstrating competitive accuracy with lower power consumption. Additionally, we propose a Processing-in-Memory (PIM) hardware platform for on-chip acceleration, further enhancing the scalability of our models.
This research represents a step towards scalable, energy-efficient, and robust SNNs, enabling their deployment in real-time, on-device learning, and inference, crucial for future AI applications in resource-constrained environments.
HRSNNs leverage the diversity in neuronal dynamics and Spike-Timing-Dependent Plasticity (STDP) to improve memory capacity, learn complex patterns, and enhance network performance. By incorporating unsupervised learning models and biologically plausible pruning techniques, we maintain network stability and computational efficiency. A notable contribution of this work is the introduction of Lyapunov Noise Pruning (LNP), which leverages temporal overparameterization to achieve significant reductions in network complexity without compromising accuracy.
Our approach also explores DNN-SNN hybrid models, which combine the strengths of deep neural networks and spiking networks for tasks such as object detection, demonstrating competitive accuracy with lower power consumption. Additionally, we propose a Processing-in-Memory (PIM) hardware platform for on-chip acceleration, further enhancing the scalability of our models.
This research represents a step towards scalable, energy-efficient, and robust SNNs, enabling their deployment in real-time, on-device learning, and inference, crucial for future AI applications in resource-constrained environments.
Going Beyond the Chicken and Egg Situation with Modern MPI Features
Hide Details
DescriptionMPI has become the de facto standard for distributed memory computing since its inception in 1994. While the MPI standard has evolved to include new technologies like RDMA, many applications still rely on the original set of MPI operations.
This thesis investigates the current usage of MPI. We note that developers underutilize modern MPI features, as their implementations often are not optimized. On the other hand, as many users rely on the "old" MPI features, MPI implementation developers have no incentive to optimize implementations for the new features. As a consequence, there is no incentive for MPI users to learn the new features, creating a vicious cycle.
To break this cycle, this thesis explores three main approaches:
1) Facilitating correctness checking tool support,
2) Modernizing MPI codes with compiler based approaches, and
3) Exploiting compiler knowledge to further optimize the implementation of modern MPI features.
In order to facilitate the development and improvement of tools aiding with MPI development, this thesis introduces the correctness benchmark MPI-BugBench as a standardized benchmark to evaluate the real-world applicability of such tools. Further, we show that compiler-based automatic modernization methods can encourage early adoption of new MPI features with minimal programmer effort (for example, partitioned operations).
Lastly, compiler knowledge can be utilized in order to further optimize the performance of MPI implementations (for example, in persistent). The use of compiler knowledge, in particular, enables modernization of existing MPI codes without the need for application developers to rewrite existing MPI codes.
This thesis investigates the current usage of MPI. We note that developers underutilize modern MPI features, as their implementations often are not optimized. On the other hand, as many users rely on the "old" MPI features, MPI implementation developers have no incentive to optimize implementations for the new features. As a consequence, there is no incentive for MPI users to learn the new features, creating a vicious cycle.
To break this cycle, this thesis explores three main approaches:
1) Facilitating correctness checking tool support,
2) Modernizing MPI codes with compiler based approaches, and
3) Exploiting compiler knowledge to further optimize the implementation of modern MPI features.
In order to facilitate the development and improvement of tools aiding with MPI development, this thesis introduces the correctness benchmark MPI-BugBench as a standardized benchmark to evaluate the real-world applicability of such tools. Further, we show that compiler-based automatic modernization methods can encourage early adoption of new MPI features with minimal programmer effort (for example, partitioned operations).
Lastly, compiler knowledge can be utilized in order to further optimize the performance of MPI implementations (for example, in persistent). The use of compiler knowledge, in particular, enables modernization of existing MPI codes without the need for application developers to rewrite existing MPI codes.
Effects of Lossy Compression Data on Machine Learning Models
Hide Details
DescriptionMachine learning is a fundamental tool that is incorporated in fields across academia and industry. Due to the large amounts of data needed for training machine learning models, compression is utilized because it reduces the data footprint playing a critical role in storage. Machine learning involves the use of algorithms and models to learn patterns in data allowing AI to make decisions without specific programming. On the other hand, compression utilizes encoding and decoding techniques to reduce file size. Compression can be lossy or lossless; lossy causes a loss of data while lossless preserves the data.
This dissertation explores the accuracy and scalability of machine learning when working with lossy distorted data. Performance metrics studied look at how accurately the model’s inference performs. Issues with machine learning performance on lossy data involve the following: data storage, data transfer bandwidth, and processing on the intersection between machine learning and lossy compression. Over these various issues, machine learning is examined in different domains. This work investigates how meaningful patterns in the distorted data are extracted.
The primary focus explores neural network models' ability to manage lossy compressed data and find ways to mitigate loss due to distortion, addressing machine learning across various domains including object detection, semantic segmentation, and image classification, to find the balance between compression ratio and data quality.
This dissertation explores the accuracy and scalability of machine learning when working with lossy distorted data. Performance metrics studied look at how accurately the model’s inference performs. Issues with machine learning performance on lossy data involve the following: data storage, data transfer bandwidth, and processing on the intersection between machine learning and lossy compression. Over these various issues, machine learning is examined in different domains. This work investigates how meaningful patterns in the distorted data are extracted.
The primary focus explores neural network models' ability to manage lossy compressed data and find ways to mitigate loss due to distortion, addressing machine learning across various domains including object detection, semantic segmentation, and image classification, to find the balance between compression ratio and data quality.
Scalable Planning Platform for Orchestration of Autonomous Systems Across Edge-Cloud Continuum
Hide Details
DescriptionEdge accelerators, such as NVIDIA Jetson, are enabling rapid inference of deep neural network (DNN) models and computer vision algorithms through low-end graphics processing unit (GPU) modules integrated with ARM-based processors. Their compact form factor allows integration with mobile platforms, such as unmanned aerial vehicles (UAVs) with onboard cameras, facilitating real-time execution of diverse scientific workflows, from wildfire monitoring to disaster management. The limited compute resources of mobile edge accelerators necessitate collaboration with remote servers in the cloud for processing compute-intensive workloads. These remote servers can include high-performance computers, serverless cloud platforms offering Functions-as-a-Service (FaaS), or private GPU servers.
In my PhD dissertation, the work proposes and implements a scalable platform designed to support multiple mobile devices (UAVs) with edge accelerators, collaborating with remote servers to provide real-time performance for a wide range of spatio-temporal autonomous applications. The platform incorporates deadline-driven scheduling heuristics, strategies for preemptively dropping tasks based on their earliest deadlines, migration of tasks from edge to cloud, work stealing from cloud back to edge, and adaptation to network variability, all while ensuring quality of service (QoS). Outputs from the servers can be used by other mobile devices or the planning platform itself to orchestrate the next set of tasks in the workflow. Evaluations against baseline algorithms and multiple workloads demonstrate that the proposed heuristics achieve an optimal balance between task completion and accrued utility.
In my PhD dissertation, the work proposes and implements a scalable platform designed to support multiple mobile devices (UAVs) with edge accelerators, collaborating with remote servers to provide real-time performance for a wide range of spatio-temporal autonomous applications. The platform incorporates deadline-driven scheduling heuristics, strategies for preemptively dropping tasks based on their earliest deadlines, migration of tasks from edge to cloud, work stealing from cloud back to edge, and adaptation to network variability, all while ensuring quality of service (QoS). Outputs from the servers can be used by other mobile devices or the planning platform itself to orchestrate the next set of tasks in the workflow. Evaluations against baseline algorithms and multiple workloads demonstrate that the proposed heuristics achieve an optimal balance between task completion and accrued utility.
Toward Performance & Portability & Productivity in Parallel Programming
Hide Details
DescriptionAchieving *performance*, *portability*, and *productivity* for data-parallel computations (e.g., MatMul and convolutions) has emerged as a major research challenge. The complex hardware design of contemporary parallel architectures, including GPUs and CPUs, requires advanced program optimizations to fully exploit the performance potential of architectures. Furthermore, due to the diverse hardware landscape, it has proven challenging to achieve (performance) portability: different architectures require different kinds of optimizations, thereby often posing challenging, often even contradicting requirements on code optimization. Also, the complexity of achieving performance and portability must be hidden behind a user-productive programming interface to make programming modern architectures amenable.
This thesis introduces a novel approach to code *generation* and *optimization* for data-parallel computations targeting modern parallel architectures. The ultimate goal of our approach is to simultaneously achieve *performance*, *portability*, and *productivity*, in one combined approach, which is identified as a major research challenge.
The first part of this thesis introduces the algebraic formalism of Multi-Dimensional Homomorphisms (MDH) — a novel approach to generating code that can be fully automatically optimized (auto-tuned) for a particular target architecture and characteristics of the input and output data (such as size and memory layout); our code generation approach is hidden behind a productive user interface that expresses a wide range of data-parallel computations.
The second part of this thesis introduces the Auto-Tuning Framework (ATF) for automatically optimizing parameterized program code (as generated by our MDH approach). In contrast to existing auto-tuners, ATF supports so-called constrained tuning parameters which are ubiquitous in modern parallel programming.
This thesis introduces a novel approach to code *generation* and *optimization* for data-parallel computations targeting modern parallel architectures. The ultimate goal of our approach is to simultaneously achieve *performance*, *portability*, and *productivity*, in one combined approach, which is identified as a major research challenge.
The first part of this thesis introduces the algebraic formalism of Multi-Dimensional Homomorphisms (MDH) — a novel approach to generating code that can be fully automatically optimized (auto-tuned) for a particular target architecture and characteristics of the input and output data (such as size and memory layout); our code generation approach is hidden behind a productive user interface that expresses a wide range of data-parallel computations.
The second part of this thesis introduces the Auto-Tuning Framework (ATF) for automatically optimizing parameterized program code (as generated by our MDH approach). In contrast to existing auto-tuners, ATF supports so-called constrained tuning parameters which are ubiquitous in modern parallel programming.
Efficient Large Dynamic Graph Analysis on Emerging Storage Technology
Hide Details
DescriptionGraph-structured data analysis is extensively used in various real-world applications, including biology, social media, and recommendation systems. With the increasing prevalence of real-time data, many graphs become dynamic and evolve over time. Thus, dynamic graph processing systems become a necessary tool to store these real-time updates and continuously run analytic algorithms to provide insights into the data. However, these systems require special design to efficiently support both tasks. As a result, we have seen a growing demand in this research direction in recent years, as numerous low-level data structures and high-level systems have addressed different aspects of dynamic graph processing.
With the high demand for data-intensive systems and the growing volume of data, many emerging storage hardware technologies have been added to the storage hierarchy, including persistent memory. Due to its promising features such as low latency, high density, and byte-addressable accessibility, persistent memory has gained the attention of researchers and developers of high-performance data-intensive applications. As such, it is not surprising that we expect to see persistent memory usage in dynamic graph processing systems due to the need for high performance and capacity. Therefore, our research aims to explore efficient ways of designing and implementing dynamic graph processing systems on persistent memory.
With the high demand for data-intensive systems and the growing volume of data, many emerging storage hardware technologies have been added to the storage hierarchy, including persistent memory. Due to its promising features such as low latency, high density, and byte-addressable accessibility, persistent memory has gained the attention of researchers and developers of high-performance data-intensive applications. As such, it is not surprising that we expect to see persistent memory usage in dynamic graph processing systems due to the need for high performance and capacity. Therefore, our research aims to explore efficient ways of designing and implementing dynamic graph processing systems on persistent memory.
Enhancing HPC I/O Performance: Leveraging Runtime and Offline I/O Optimization Frameworks
Hide Details
DescriptionThe existing parallel I/O stack is complex and difficult to tune due to the interdependencies among multiple factors that impact the performance of data movement between storage and compute systems. When performance is slower than expected, end-users, developers, and system administrators rely on I/O profiling and tracing information to pinpoint the root causes of inefficiencies. Despite having numerous tools that collect I/O metrics on production systems, it is not obvious where the I/O bottlenecks are (unless one is an I/O expert), their root causes, and what to do to solve them. Hence, there is a gap between the currently available metrics, the issues they represent, and the application of optimizations that would mitigate performance slowdowns. Streamlining such analysis, investigation, and recommendations could close this gap without requiring a specialist to intervene in every case.
This dissertation explores how this translation gap can be closed by introducing two innovative frameworks that leverage both offline and online analysis and tuning methodologies. The offline framework, named Drishti I/O, provides interactive visualizations that detail an application's I/O behavior. It pinpoints the root causes of I/O bottlenecks and offers actionable recommendations to enhance performance. The runtime framework extends the capabilities of the Recorder I/O tracing tool by incorporating a dynamic I/O prediction and optimization system. This system leverages context-free grammar to optimize I/O behavior in real time during application execution. Together, these frameworks offer a comprehensive approach to improving I/O performance through detailed analysis and real-time optimizations.
This dissertation explores how this translation gap can be closed by introducing two innovative frameworks that leverage both offline and online analysis and tuning methodologies. The offline framework, named Drishti I/O, provides interactive visualizations that detail an application's I/O behavior. It pinpoints the root causes of I/O bottlenecks and offers actionable recommendations to enhance performance. The runtime framework extends the capabilities of the Recorder I/O tracing tool by incorporating a dynamic I/O prediction and optimization system. This system leverages context-free grammar to optimize I/O behavior in real time during application execution. Together, these frameworks offer a comprehensive approach to improving I/O performance through detailed analysis and real-time optimizations.
Q-NFSO: Exploring Quantum Applications, Noise Management, Fault Injection, Resource Scheduling and Optimization in the NISQ Era
Hide Details
DescriptionQuantum computing has achieved significant milestones in recent years, underscoring its potential benefits for NP-hard applications both currently and in the future. Despite these advancements, contemporary quantum computers are hindered by noise and a limited number of qubits. These limitations pose significant challenges for quantum applications, noise management, resource scheduling, and optimization. This research addresses the gap between quantum algorithms and hardware characteristics by examining quantum applications from high-level circuit definitions to noise model generation, fault injection, resource management, and optimization.
This study highlights the quantum advantage over classical computing, explores early quantum neural networks, evaluates quantum metrics, and constructs a noise model based on quantum hardware using fault injection techniques to investigate the vulnerabilities of quantum algorithms, operations, and qubits. Our research considers the uncertainty factors of qubits, including random faults and single and double fault injections with circuit cutting and its limitations. The final stage evaluates the performance of quantum jobs submitted to backends under controlled noise, uncontrolled errors, and established metrics.
The proposed Quantum NFSO (Noise, Fault, and Scheduling Optimization) model presents a comprehensive approach to scheduling and optimization, accounting for noise, random errors, resource management, job scheduling, and circuit optimization. While quantum computers hold immense potential, it is essential to calibrate expectations appropriately. In the NISQ (Noisy Intermediate-Scale Quantum) era, scientists and researchers must align their perspectives with the technology's limitations. Understanding these constraints is crucial for advancing future computational technologies, including quantum computing.
This study highlights the quantum advantage over classical computing, explores early quantum neural networks, evaluates quantum metrics, and constructs a noise model based on quantum hardware using fault injection techniques to investigate the vulnerabilities of quantum algorithms, operations, and qubits. Our research considers the uncertainty factors of qubits, including random faults and single and double fault injections with circuit cutting and its limitations. The final stage evaluates the performance of quantum jobs submitted to backends under controlled noise, uncontrolled errors, and established metrics.
The proposed Quantum NFSO (Noise, Fault, and Scheduling Optimization) model presents a comprehensive approach to scheduling and optimization, accounting for noise, random errors, resource management, job scheduling, and circuit optimization. While quantum computers hold immense potential, it is essential to calibrate expectations appropriately. In the NISQ (Noisy Intermediate-Scale Quantum) era, scientists and researchers must align their perspectives with the technology's limitations. Understanding these constraints is crucial for advancing future computational technologies, including quantum computing.
Data Layout Optimizations for Tensor Applications
Hide Details
DescriptionThe performance of tensor applications is often bottlenecked by data movement across the memory subsystem. This dissertation contributes domain-specific programming frameworks (compilers and runtime systems) that optimize data movement in tensor applications. We develop novel execution reordering and data reorganization techniques, achieving performance portability along with improved programmability.
We present BrickDL, a compiler framework that performs "merged execution" of fused deep learning operators as graph-level optimization. We employ fine-grained data blocking with "bricks" — a data layout of small, fixed-size blocks of contiguously packed data that enhance on-chip data locality on GPUs. BrickDL demonstrates up to 18% improved performance and 16% reduced DRAM data movement compared to existing deep learning frameworks for prominent models on NVIDIA and AMD GPUs.
The sequence of layers in neural networks is analogous to the nested hierarchy of grids in the Geometric Multigrid (GMG) iterative solver. The series of stencil calculations in the GMG V-cycle results in its memory-bound performance. We hence extend the optimizations in BrickDL to BrickGMG, a framework for restructuring computations and exploiting inter-operator reuse in the V-cycle. BrickGMG provides performance portability across NVIDIA, AMD, Intel GPUs, achieving 55% speedup over HPGMG and 73% of Roofline performance on average.
We develop MLTT, a compiler optimization pipeline in LLVM MLIR for arbitrary tensor transpositions, which are the primary performance bottleneck in tensor contractions for transforming data layouts. MLTT is portable across various CPU vector instruction sets. We integrate MLTT with COMET, an MLIR-based compiler, and present speedups of >40% for memory-bound tensor contractions.
We present BrickDL, a compiler framework that performs "merged execution" of fused deep learning operators as graph-level optimization. We employ fine-grained data blocking with "bricks" — a data layout of small, fixed-size blocks of contiguously packed data that enhance on-chip data locality on GPUs. BrickDL demonstrates up to 18% improved performance and 16% reduced DRAM data movement compared to existing deep learning frameworks for prominent models on NVIDIA and AMD GPUs.
The sequence of layers in neural networks is analogous to the nested hierarchy of grids in the Geometric Multigrid (GMG) iterative solver. The series of stencil calculations in the GMG V-cycle results in its memory-bound performance. We hence extend the optimizations in BrickDL to BrickGMG, a framework for restructuring computations and exploiting inter-operator reuse in the V-cycle. BrickGMG provides performance portability across NVIDIA, AMD, Intel GPUs, achieving 55% speedup over HPGMG and 73% of Roofline performance on average.
We develop MLTT, a compiler optimization pipeline in LLVM MLIR for arbitrary tensor transpositions, which are the primary performance bottleneck in tensor contractions for transforming data layouts. MLTT is portable across various CPU vector instruction sets. We integrate MLTT with COMET, an MLIR-based compiler, and present speedups of >40% for memory-bound tensor contractions.
Designing Efficient Data Reduction Approaches for Multi-Resolution Simulations on HPC Systems
Hide Details
DescriptionAs supercomputers advance towards exascale capabilities, computational intensity increases significantly, and the volume of data requiring storage and transmission experiences exponential growth. Multi-resolution methods, such as Adaptive Mesh Refinement (AMR), have emerged as an effective solution to address these challenges. Concurrently, error-bounded lossy compression is recognized as one of the most efficient approaches to tackle the latter issue. Despite their respective advantages, few attempts have been made to investigate how the multi-resolution method and error-bounded lossy compression can function together.
To address this gap, this dissertation introduces a series of optimizations for data reduction solutions in multi-resolution simulations:
(1) This dissertation first enhances the offline compression quality of multi-resolution data for different state-of-the-art scientific compressors by adaptively preprocessing the data and optimizing the compressor.
(2) This dissertation then presents a novel in-situ lossy compression framework, utilizing HDF5 and enhanced SZ2, specifically tailored for real-world AMR applications. This framework can reduce I/O costs and improve compression quality.
(3) Furthermore, to extend the usability of multi-resolution techniques, this dissertation introduces a workflow for multi-resolution data compression, applicable to both uniform and AMR simulations. Initially, the workflow employs a Region of Interest (ROI) extraction approach to enable multi-resolution methods for uniform data. Subsequently, to bridge the gap between multi-resolution techniques and lossy compressors, we optimize three distinct compressors, ensuring their optimal performance on multi-resolution data. Lastly, we incorporate an advanced uncertainty visualization method into our workflow to help users understand the potential impacts of lossy compression.
To address this gap, this dissertation introduces a series of optimizations for data reduction solutions in multi-resolution simulations:
(1) This dissertation first enhances the offline compression quality of multi-resolution data for different state-of-the-art scientific compressors by adaptively preprocessing the data and optimizing the compressor.
(2) This dissertation then presents a novel in-situ lossy compression framework, utilizing HDF5 and enhanced SZ2, specifically tailored for real-world AMR applications. This framework can reduce I/O costs and improve compression quality.
(3) Furthermore, to extend the usability of multi-resolution techniques, this dissertation introduces a workflow for multi-resolution data compression, applicable to both uniform and AMR simulations. Initially, the workflow employs a Region of Interest (ROI) extraction approach to enable multi-resolution methods for uniform data. Subsequently, to bridge the gap between multi-resolution techniques and lossy compressors, we optimize three distinct compressors, ensuring their optimal performance on multi-resolution data. Lastly, we incorporate an advanced uncertainty visualization method into our workflow to help users understand the potential impacts of lossy compression.
FFT-Based Spherical Harmonics and Radial Transforms on GPU
Hide Details
DescriptionModern high-performance computing clusters switch to the GPUs, as opposed to CPUs, as the source of their computational power. GPUs are tailored for data-parallel algorithms where multiple cores perform the same operations on different memory locations. However, making CPU code run within GPU constraints is often a non-trivial task. Firstly, not all algorithms are easy to parallelize. Secondly, there is no single way to program GPUs from different manufacturers — all of them try to promote their own solutions. To solve this issue, a runtime GPU code generation and optimization platform, PfSolve, has been developed during this PhD. Originally based on VkFFT (Vulkan/CUDA/HIP/OpenCL/Level Zero/Metal Fast Fourier Transform library), PfSolve has been generalized and restructured.
QuiCC is a code under development in our research group designed to solve the equations of magnetohydrodynamics in a full sphere and other geometries. It uses a fully spectral approach to the problem, with the Jones-Worland (JW) polynomials as a radial basis and spherical harmonics (SH) as a spherical basis. The main goal of this dissertation is a GPU implementation of the FFT-based algorithm for their evaluation, which is more accurate and requires less memory than the regular quadrature approach. One of the main building blocks used by it is the efficient tridiagonal GPU solver, developed with the new warp-programming approach.
This work also presents additional algorithms redesigned within the platform, such as finite differences solver and double-double emulation of FP128 calculations on GPUs.
QuiCC is a code under development in our research group designed to solve the equations of magnetohydrodynamics in a full sphere and other geometries. It uses a fully spectral approach to the problem, with the Jones-Worland (JW) polynomials as a radial basis and spherical harmonics (SH) as a spherical basis. The main goal of this dissertation is a GPU implementation of the FFT-based algorithm for their evaluation, which is more accurate and requires less memory than the regular quadrature approach. One of the main building blocks used by it is the efficient tridiagonal GPU solver, developed with the new warp-programming approach.
This work also presents additional algorithms redesigned within the platform, such as finite differences solver and double-double emulation of FP128 calculations on GPUs.
High-Performance Computing Resilience Analysis Using Large Language Models
Hide Details
DescriptionThis doctoral showcase highlights three pivotal works conducted during my PhD that collectively advance the field of high-performance computing (HPC) resilience analysis using large language models (LLMs).
The first work introduces HAPPA, a modular platform for HPC Application Resilience Analysis. HAPPA integrates LLMs to understand long code sequences, employing innovative code representation techniques to predict resilience accurately. Through the DARE dataset, HAPPA demonstrates superior predictive accuracy over existing models, achieving a mean squared error (MSE) of 0.078 in Silent Data Corruption (SDC) prediction, significantly outperforming the PARIS model.
Building on this foundation, the second work investigates the resilience of loops in HPC programs through a semantic approach. By analyzing the computational patterns known as the 13 dwarfs of parallelism, this study quantifies the SDC rates for each pattern. Utilizing LLMs with prompt engineering, the research identifies loop semantics, providing insights into which loops are more error-prone and enhancing the development of resilient HPC applications.
Expanding the scope further, the third work evaluates the capabilities of LLMs in comprehending the syntax and semantics of Intermediate Representation (IR) code. The study conducts a comprehensive analysis using models like GPT-4o, GPT-3.5, and CodeLlama. By performing tasks such as decompiling IR code, generating CFGs, and simulating IR code execution, the research provides insights into the effectiveness of LLMs in handling low-level code analysis and their potential applications in program analysis.
These studies collectively demonstrate the potential of LLMs in enhancing the resilience of HPC applications through innovative analysis techniques and predictive modeling.
The first work introduces HAPPA, a modular platform for HPC Application Resilience Analysis. HAPPA integrates LLMs to understand long code sequences, employing innovative code representation techniques to predict resilience accurately. Through the DARE dataset, HAPPA demonstrates superior predictive accuracy over existing models, achieving a mean squared error (MSE) of 0.078 in Silent Data Corruption (SDC) prediction, significantly outperforming the PARIS model.
Building on this foundation, the second work investigates the resilience of loops in HPC programs through a semantic approach. By analyzing the computational patterns known as the 13 dwarfs of parallelism, this study quantifies the SDC rates for each pattern. Utilizing LLMs with prompt engineering, the research identifies loop semantics, providing insights into which loops are more error-prone and enhancing the development of resilient HPC applications.
Expanding the scope further, the third work evaluates the capabilities of LLMs in comprehending the syntax and semantics of Intermediate Representation (IR) code. The study conducts a comprehensive analysis using models like GPT-4o, GPT-3.5, and CodeLlama. By performing tasks such as decompiling IR code, generating CFGs, and simulating IR code execution, the research provides insights into the effectiveness of LLMs in handling low-level code analysis and their potential applications in program analysis.
These studies collectively demonstrate the potential of LLMs in enhancing the resilience of HPC applications through innovative analysis techniques and predictive modeling.
Supporting End Users in Implementing Quantum Computing Applications
Hide Details
DescriptionQuantum computing applications require expert knowledge to perform complex steps: (1) selecting a suitable quantum algorithm, (2) generating the quantum circuit, (3) compiling/executing the quantum circuit, and (4) decoding the results. This creates high entry barriers for end users with limited expertise who need solutions for domain-specific problems. This poster highlights methods developed to assist end users, resulting in multiple open-source tools in the Munich Quantum Toolkit (MQT) on GitHub.
The poster focuses on three main tasks:
1) End-User Workflow (MQT ProblemSolver): Providing a workflow from classical input to a quantum solution, then returning it to classical format.
2) Quantum Device Selection and Compilation (MQT Predictor): Selecting and efficiently compiling the most suitable quantum device.
3) Benchmark Suite (MQT Bench): Offering representative test cases in a benchmark suite of quantum applications.
These tools simplify quantum computing, making it accessible to non-experts.
The poster focuses on three main tasks:
1) End-User Workflow (MQT ProblemSolver): Providing a workflow from classical input to a quantum solution, then returning it to classical format.
2) Quantum Device Selection and Compilation (MQT Predictor): Selecting and efficiently compiling the most suitable quantum device.
3) Benchmark Suite (MQT Bench): Offering representative test cases in a benchmark suite of quantum applications.
These tools simplify quantum computing, making it accessible to non-experts.
Accelerating Communications in High-Performance Scientific Workflows
Hide Details
DescriptionAdvances in networks, accelerators, and cloud services encourage programmers to reconsider where to compute — such as when fast networks make it cost-effective to compute on remote accelerators despite added latency. Workflow and cloud-hosted serverless computing frameworks can manage multi-step computations spanning federated collections of cloud, high-performance computing, and edge systems, but passing data among computational steps remains a challenge when applications are a composition of multiple distinct software with differing communication and patterns.
This work introduces a new programming paradigm that decouples data flow from control flow by extending the pass-by-reference model to distributed applications. ProxyStore, developed here, implements this paradigm through object proxies that act as wide-area object references with just-in-time resolution. The proxy model enables producers to communicate data unilaterally, transparently, and efficiently to both local and remote consumers. This decoupling enables the dynamic selection of different data movement methods, depending on what data are moved, where data are moved, or when data are moved — a longstanding challenge in distributed applications.
The efficacy of the proxy paradigm is further understood through four high-level proxy-based programming patterns applied to real-world computational science applications. These high-level patterns — distributed futures, streaming, ownership, and stateful actors — make the power of the proxy paradigm accessible for more complex and dynamic distributed program structures. ProxyStore is evaluated through standardized benchmark suites, introduced here, and meaningful science applications, spanning bioinformatics, federated learning, and molecular design, in which substantial improvements in runtime, throughput, and memory usage are demonstrated.
This work introduces a new programming paradigm that decouples data flow from control flow by extending the pass-by-reference model to distributed applications. ProxyStore, developed here, implements this paradigm through object proxies that act as wide-area object references with just-in-time resolution. The proxy model enables producers to communicate data unilaterally, transparently, and efficiently to both local and remote consumers. This decoupling enables the dynamic selection of different data movement methods, depending on what data are moved, where data are moved, or when data are moved — a longstanding challenge in distributed applications.
The efficacy of the proxy paradigm is further understood through four high-level proxy-based programming patterns applied to real-world computational science applications. These high-level patterns — distributed futures, streaming, ownership, and stateful actors — make the power of the proxy paradigm accessible for more complex and dynamic distributed program structures. ProxyStore is evaluated through standardized benchmark suites, introduced here, and meaningful science applications, spanning bioinformatics, federated learning, and molecular design, in which substantial improvements in runtime, throughput, and memory usage are demonstrated.
Accelerating HPC Workflow Results and Performance Reproducibility Analytics
Hide Details
DescriptionModern high-performance computing (HPC) workflows produce massive datasets, often exceeding 100+ TB per day, driven by instruments collecting data at gigabytes per second. These workflows, executed on advanced HPC systems with heterogeneous storage devices, high-performance microprocessors, accelerators, and interconnects, are increasingly complex and often involve non-deterministic computations. In this context, thousands of processes share computing resources using synchronization for consistency. The intricate process interaction and existing non-deterministic operations challenge explorations of workflow behaviors to ensure reproducibility, optimize performance, and reason about what happens when processes compete for resources. Existing reproducibility analysis frameworks are not well-suited to identify the sources and locations of non-determinism and performance variations, as they often focus on the final workflow results and general statistics about workflow performance.
We address these challenges by introducing scalable techniques that accelerate intermediate workflow results' comparison using variation-tolerant hashing of floating-point datasets, thus improving result reproducibility. We also capture workflow performance profiles and benchmark various queries to analyze workflow performance reproducibility. We also identify opportunities to optimize the loading process and indexing of performance data to ensure minimal initialization and querying overhead. Using collected performance data, we propose a cache-aware staggering technique that leverages workflow I/O profiles to reduce bottlenecks and resource contention, particularly in workflows that share the same input data. Our evaluations across molecular dynamics, cosmology, and deep learning workflows demonstrate significant speedup in intermediate results reproducibility analyses compared to state-of-art baselines and our ability to propose workflow execution strategies that maximize cache reuse and minimize execution makespan.
We address these challenges by introducing scalable techniques that accelerate intermediate workflow results' comparison using variation-tolerant hashing of floating-point datasets, thus improving result reproducibility. We also capture workflow performance profiles and benchmark various queries to analyze workflow performance reproducibility. We also identify opportunities to optimize the loading process and indexing of performance data to ensure minimal initialization and querying overhead. Using collected performance data, we propose a cache-aware staggering technique that leverages workflow I/O profiles to reduce bottlenecks and resource contention, particularly in workflows that share the same input data. Our evaluations across molecular dynamics, cosmology, and deep learning workflows demonstrate significant speedup in intermediate results reproducibility analyses compared to state-of-art baselines and our ability to propose workflow execution strategies that maximize cache reuse and minimize execution makespan.