2017 International Conference on Networking, Architecture, and Storage (NAS)
Download PDF

Abstract

Coordinated checkpointing is a widely-used checkpoint/restart (CPR) technique for fault-tolerance in large-scale HPC systems. However, this CPR technique will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on multi-level checkpointing that allows the use of different kinds of fast but less reliable storages to reduce the checkpointing frequency to parallel file system (PFS). This paper presents an energy model of multi-level checkpointing and proposes an iterative algorithm that minimizes energy consumption by optimizing the checkpoint interval of each level and selecting the best combination of checkpoint levels. It is confirmed that the algorithm is very fast and effective since it can reach convergence in a relatively small number of iteration steps. This paper also clarifies the fact that it is actually unnecessary to use all the available checkpoint levels in a multi-level CPR mechanism. By selectively using only appropriate checkpoint levels, a significant increase in energy efficiency (9 to 21%) is observed.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles