ReStore checkpointing library

19. October 2022

Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload the lost data. ReStore is a C++ header-only library for MPI programs that enables recovery of lost data after (a) process failure(s). By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As you as the application programmer can specify which data to load ReStore also supports shrinking recovery instead of recovery using spare compute nodes.

See more here:

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms; Lukas Hübner, Demian Hespe, Peter Sanders, Alexandros Stamatakis; Preprint at arXiv:

About HITS

