Lightweight Checkpointing of Loop-Based Kernels Using Disaggregated Memory


Checkpoint/Restart (C/R) is usually used to provide fault tolerance in HPC systems, where the increasing number of hardware components and complexity of software stack increase overall fault rates. Meanwhile, the ever growing power consumption of HPC systems has fostered the exploration of using renewable energy to power the systems. Due to the variable nature of renewable energy sources, such as wind and solar, jobs running on HPC Systems need to be suspended or migrated to fit within the power availability. C/R, in such cases, is also indispensable. Typically, C/R techniques periodically save the status of running processes to checkpoint files stored on a parallel file system. However, this process is expensive; a single checkpoint can take the order of tens of minutes, rendering checkpointing less practical. We propose a lightweight checkpointing technique that uses disaggregated memory as a checkpoint target for loop-based kernels. The disaggregated memory is a concept similar to disaggregated storage; it provides a shared memory pool for compute nodes. With the xBGAS-enabled remote memory accessibility, we hope to checkpoint the necessary data objects on remote memory with minimum overhead and to be able to restart the application from any available compute nodes. In this seminar talk, we will summarize current Checkpoint/Restart techniques, present the concept of the disaggregated memory, and discuss about the proposed C/R, which we name it CRaaS (C/R as a Service). Finally, we will use a loop-based kernel to illustrate the design of CRaaS APIs.

Download slides here