Close

Presentation

Silent Errors to Scientific Applications: Impacts of PFS Metadata Corruptions
DescriptionHigh-performance computing (HPC) applications, such as Nyx, QMCPACK, and Montage, depend on parallel file systems (PFS) like Lustre, BeeGFS, and PVFS for reliable and efficient data management and access. However, PFS can fail due to hardware faults, software bugs, or power outages. These failures are generally categorized as: fail-stop failures, which render the PFS unmountable or inaccessible; and partial failures, which compromise specific PFS components, allowing the system to remain functional but potentially causing unnoticed damage or silent errors. There have been lots of studies analyzing the data corruptions due to both fail-stop behaviors and partial failures. However, they ignore potentially more complicated corruption in special data areas, particularly the metadata area of parallel file systems, which is the focus of this study.