2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Download PDF

Abstract

With the rapid scale out of supercomputers comes a corresponding higher failure frequency. Fault-tolerant methods have evolved to adapt to high rates of failure, but the behavior of MPI, the most widely used scalable programming middleware, is insufficient when confronting such failures. We present FA-MPI (Fault-Aware MPI), a set of extensions to the MPI standard designed to enable applications to implement a wide range of fault-tolerant methods. FA-MPI introduces transactional concepts to the MPI programming model for the first time to address failure detection, isolation, mitigation, and recovery via application-driven policies. To reach the maximum achievable performance of these scalable machines, overlapping communication and I/O with computation through non-blocking operations (while reducing jitter) are design themes of growing importance. Therefore, we emphasize fault tolerant, non-blocking communication operations combined with a set of nest able lightweight transactional Try Block API extensions architected to exploit system and application hierarchy both for failure detection and recovery. This is to enable applications to run to completion with higher probability than otherwise. Scaling up and out and fault-free overhead are key concerns that can be managed by tuning transaction granularity, we provide a simulation of FA-MPI in a stencil 3D program to illustrate this. Supported failure models include but are not limited to process failures, a key difference from other proposed fault-tolerant extensions to MPI. Restriction to non-blocking operations is a current limitation as compared to other proposed approaches insofar as legacy applications are concerned, but FA-MPI aligns well with future-looking applications emphasizing Exascale. And, tools to evolve legacy MPI programs to this fault-aware paradigm will soon bridge that portability gap.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles