Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload the lost data. ReStore is a C++ header-only library for MPI programs that enables recovery of lost data after (a) process failure(s). By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As you as the application programmer can specify which data to load ReStore also supports shrinking recovery instead of recovery using spare compute nodes.
See more here: https://github.com/ReStoreCpp/ReStore
L. Hübner, D. Hespe, P. Sanders and A. Stamatakis, “ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms,” 2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), Dallas, TX, USA, 2022, pp. 24-35, doi: 10.1109/FTXS56515.2022.00008. https://ieeexplore.ieee.org/document/10024016
HITS, the Heidelberg Institute for Theoretical Studies, was established in 2010 by physicist and SAP co-founder Klaus Tschira (1940-2015) and the Klaus Tschira Foundation as a private, non-profit research institute. HITS conducts basic research in the natural, mathematical, and computer sciences. Major research directions include complex simulations across scales, making sense of data, and enabling science via computational research. Application areas range from molecular biology to astrophysics. An essential characteristic of the Institute is interdisciplinarity, implemented in numerous cross-group and cross-disciplinary projects. The base funding of HITS is provided by the Klaus Tschira Foundation.