Towards Continuous Checkpointing for HPC Systems Using CXL

Ellis Giles; Peter Varman

doi:10.1109/HiPCW63042.2024.00026

Abstract

Throughout the modern computing era, the complexity of high-performance computing systems has increased exponentially, both in terms of the number of nodes contained within a system and the number of transistors. This exponential complexity increase introduces probabilities for system failures that may affect a single node or element on a node. If these failures are not properly mitigated, a single failure may disrupt or halt the entire system.To address these potential failures, modern systems implement a form of checkpointing, whereby the system state is saved and restarted if a failure occurs. However, synchronous checkpointing an entire system pre-failure can be costly, requiring nodes to wait, often causing a significant decrease in system application performance.In this work, we introduce Continuous Checkpointing for HPC Systems, creating an asynchronous continuous checkpoint, and if a failure occurs, revert to a consistent global checkpoint. We utilize emerging Compute eXpress Link (CXL) Memory to lazily post computational state to nearby nodes and allow users to trade space for time efficiently. Our methods show a significant increase in system utilization over traditional checkpointing, and our lightweight method performs close to an unsafe system with no checkpoints.

Towards Continuous Checkpointing for HPC Systems Using CXL

Authors

Abstract

Related Articles