2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Download PDF

Abstract

Checkpoint/Restart (C/R) is a widely used fault tolerance mechanism in converged systems of cloud, edge, and HPC. However, users often rely on their experience to determine which variables to checkpoint, as there is currently no benchmark that can provide a reference. This can result in checkpointing redundant or even incorrect variables. To address this issue, we propose a benchmark suite that includes critical variables for checkpointing, which have been manually identified, and a method for identifying those critical variables, with 20 representative HPC applications. Our method involves analyzing data dependency between variables to identify critical variables analytically. We verify the identified variables' correctness with a widely used C/R library FTI by an ablation study. With our benchmark suite and data dependency analysis, HPC practitioners now have a reference for identifying checkpointing variables and better knowledge of what kind of variables to checkpoint.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles