ART: Robustness of Meshes and Tori for Parallel and Distributed Computation

Chi-Hsiang Yeh; Behrooz Parhami

doi:10.1109/ICPP.2002.1040903

Abstract

In this paper, we formulate the array robustness theorems (ARTs) for efficient computation and communication on faulty arrays. No hardware redundancy is required and no assumption is made about the availability of a complete submesh or subtorus. Based on ARTs, a very wide variety of problems, including sorting, FFT, total exchange, permutation, and some matrix operations, can be solved with a slowdown factor of 1 +o(1). The number of faults tolerated by ARTs ranges from o(\min(n^{1-\frac{n}{d},\frac{n}{h})) for n-ary d-cubes with worst-case faults to as large as o(N) for most N-node 2-D meshes or tori with random faults, where h is the number of data otems per processor. The resultant running times are the best results reported thus far for solving many problems on faulty arrays. Based on ARTs and several other components such as robust libraries, the priority emulation discipline, and X1Y1 routing, we introduce the robust adaptation interface layer (RAIL) as a middleware between ordinary algorithms/programs (that are originally developed for fault-free arrays) and the faulty network/hardware. In effect, RAIL provides a virtual fault-free network to higher layers, while ordinary algorithms/programs are transformed through RAIL into corresponding robust algorithms/programs that can run on faulty networks.

ART: Robustness of Meshes and Tori for Parallel and Distributed Computation

Authors

Abstract

Related Articles