2024 IEEE 31st International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW)
Download PDF

Abstract

Throughout the modern computing era, the complexity of high-performance computing systems has increased exponentially, both in terms of the number of nodes contained within a system and the number of transistors. This exponential complexity increase introduces probabilities for system failures that may affect a single node or element on a node. If these failures are not properly mitigated, a single failure may disrupt or halt the entire system.To address these potential failures, modern systems implement a form of checkpointing, whereby the system state is saved and restarted if a failure occurs. However, synchronous checkpointing an entire system pre-failure can be costly, requiring nodes to wait, often causing a significant decrease in system application performance.In this work, we introduce Continuous Checkpointing for HPC Systems, creating an asynchronous continuous checkpoint, and if a failure occurs, revert to a consistent global checkpoint. We utilize emerging Compute eXpress Link (CXL) Memory to lazily post computational state to nearby nodes and allow users to trade space for time efficiently. Our methods show a significant increase in system utilization over traditional checkpointing, and our lightweight method performs close to an unsafe system with no checkpoints.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles