Close

Presentation

Distributed, Resilient and In-Memory Storage of Key-Value Data for HPC
DescriptionCheckpointing — i.e., regularly saving relevant data in a resilient store — is a common approach to protect programs against hardware failures on clusters. While existing checkpointing libraries, such as VeloC, focus on iterative applications, Asynchronous Many-Task (AMT) programs pose specific requirements that affect the design of this store.

Many AMT runtimes use independent worker processes that balance their load via work stealing. The workers naturally write separate checkpoints autonomously at their respective task boundaries. To keep them in sync, many small write operations are performed at unpredictable time intervals. Reads, on the other hand, are rare. Recovery can be localized, but then involves complicated protocols and transactions.

This talk will elaborate on the specific features that AMT checkpointing and recovery requires from a resilient store. We'll discuss existing storage solutions, of which none seem sufficient yet. We'll argue that a distributed, in-memory, key-value store may be most appropriate.