Benchmarking Variables for Checkpointing in HPC Applications

Xiang Fu; Xin Huang; Wubiao Xu; Weiping Zhang; Shiman Meng; Luanzheng Guo; Kento Sato

doi:10.1109/IPDPSW63119.2024.00090

2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Benchmarking Variables for Checkpointing in HPC Applications

Year: 2024, Pages: 406-413

DOI Bookmark: 10.1109/IPDPSW63119.2024.00090

Authors

Xiang Fu, Nanchang Hangkong University
Xin Huang, Nanchang Hangkong University
Wubiao Xu, Nanchang Hangkong University
Weiping Zhang, Nanchang Hangkong University
Shiman Meng, Nanchang Hangkong University
Luanzheng Guo, Pacific Northwest National Laboratory
Kento Sato, RIKEN,R-CCS

Abstract

Checkpoint/Restart (C/R) is a widely used fault tolerance mechanism in converged systems of cloud, edge, and HPC. However, users often rely on their experience to determine which variables to checkpoint, as there is currently no benchmark that can provide a reference. This can result in checkpointing redundant or even incorrect variables. To address this issue, we propose a benchmark suite that includes critical variables for checkpointing, which have been manually identified, and a method for identifying those critical variables, with 20 representative HPC applications. Our method involves analyzing data dependency between variables to identify critical variables analytically. We verify the identified variables' correctness with a widely used C/R library FTI by an ablation study. With our benchmark suite and data dependency analysis, HPC practitioners now have a reference for identifying checkpointing variables and better knowledge of what kind of variables to checkpoint.

Like what you’re reading?

Already a member?

Get this article FREE with a new membership!

A Validation Approach for Quasi-Synchronous Checkpointing Algorithms in HPC Systems
2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA)
Environmental-aware optimization of MPI checkpointing intervals
2008 IEEE International Conference on Cluster Computing
Marriage Between Coordinated and Uncoordinated Checkpointing for the Exascale Era
2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS) and 2015 IEEE 12th International Conf on Embedded Software and Systems (ICESS)
Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling Policies
IEEE Transactions on Dependable and Secure Computing
Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model
IEEE Transactions on Parallel & Distributed Systems
iCheck: Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems
2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS)
Checkpoint Restart Support for Heterogeneous HPC Applications
2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)
Design and Study of Elastic Recovery in HPC Applications
2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)
Optimal Checkpointing Strategies for Iterative Applications
IEEE Transactions on Parallel & Distributed Systems
AutoCheck: Automatically Identifying Variables for Checkpointing by Data Dependency Analysis
SC24: International Conference for High Performance Computing, Networking, Storage and Analysis

Benchmarking Variables for Checkpointing in HPC Applications

Authors

Abstract

Related Articles