High Performance Computing and Communication & IEEE International Conference on Embedded Software and Systems, IEEE International Conference on
Download PDF

Abstract

The Grid is an heterogeneous and dynamic environment which enables distributed computation. This makes it a technology prone to failures. Some related work uses replication to overcome failures in a set of independent tasks, and in workflow applications, but they do not consider possible resource limitations when scheduling the replicas. In this paper, we focus on the use of task replication techniques for workflow applications, trying to achieve not only tolerance to the possible failures in an execution, but also to speed up the computation without demanding the user to implement an application-level checkpoint, which may be a difficult task depending on the application. Moreover, we also study what to do when there are not enough resources for replicating all running tasks. We establish different priorities of replication depending on the graph of the workflow application, giving more priority to tasks with a higher output degree. We have implemented our proposed policy in the GRID superscalar system, and we have run the fastDNAml as an experiment to prove our objectives are reached. Finally, we have identified and studied a problem which may arise due to the use of replication in workflow applications: the replication wait time.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles