ReStore checkpointing library

24. January 2023

Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload the lost data. ReStore is a C++ header-only library for MPI programs that enables recovery of lost data after (a) process failure(s). By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As you as the application programmer can specify which data to load ReStore also supports shrinking recovery instead of recovery using spare compute nodes.

See more here:

L. Hübner, D. Hespe, P. Sanders and A. Stamatakis, “ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms,” 2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), Dallas, TX, USA, 2022, pp. 24-35, doi: 10.1109/FTXS56515.2022.00008.

About HITS

The Heidelberg Institute for Theoretical Studies (HITS) was established in 2010 by the physicist and SAP co-founder Klaus Tschira (1940-2015) and the Klaus Tschira Foundation as a private, non-profit research institute. HITS conducts basic research in the natural sciences, mathematics and computer science, with a focus on the processing, structuring, and analyzing of large amounts of complex data and the development of computational methods and software. The research fields range from molecular biology to astrophysics. The shareholders of HITS are the HITS-Stiftung, which is a subsidiary of the Klaus Tschira Foundation, Heidelberg University and the Karlsruhe Institute of Technology (KIT). HITS also cooperates with other universities and research institutes and with industrial partners. The base funding of HITS is provided by the HITS Stiftung with funds received from the Klaus Tschira Foundation. The primary external funding agencies are the Federal Ministry of Education and Research (BMBF), the German Research Foundation (DFG), and the European Union.

Switch to the German homepage or stay on this page