Session
14th Workshop on Fault-Tolerance for HPC at eXtreme Scale (FTXS 2024)
Session Chairs
DescriptionIncreases in the number, variety, and complexity of components required to compose next-generation extreme-scale systems mean that systems will experience significant increases in aggregate fault rates, fault diversity, and fault complexity. Additionally, the growing importance of AI/ML workloads, increasing system heterogeneity, and the emergence of novel computing paradigms (neuromorphic, quantum) introduce fault tolerance issues that the research community has just begun to address. Due to the continued need for research on fault tolerance in extreme-scale systems, the 14th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2024) will present an opportunity for innovative research ideas to be shared, discussed, and evaluated by researchers in fault-tolerance, resilience, and reliability from academic, government, and industrial institutions. Building on the success of the previous editions of the FTXS workshop, we will assemble quality publications and a featured speaker to facilitate a lively and thought-provoking group discussion.
Event TypeWorkshop
TimeFriday, 22 November 20248:30am - 12pm EST
LocationB306
Distributed Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
W
Presentations
8:30am - 8:32am EST | FTXS 2024: Opening Remarks | |
8:32am - 9:30am EST | Error-Resilient Machine Learning for HPC: Challenges and Opportunities Presenter | |
9:30am - 10:00am EST | Achieving High-Performance Fault-Tolerant Routing in HyperX Interconnection Networks | |
10:00am - 10:30am EST | FTXS — Morning Break | |
10:30am - 11:00am EST | From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC Environments | |
11:00am - 11:30am EST | Octopus: Experiences with a Hybrid Event-Driven Architecture for Distributed Scientific Computing | |
11:30am - 11:59am EST | Checkpointing Strategies for a Fixed-Length Execution | |
11:59am - 12:00pm EST | FTXS 2024: Closing Remarks |