Abstract
Scientific Workflow Management Systems (S-WFMS), such as Kepler, have proven to be an important tools in scientific problem solving. Interestingly, S-WFMS fault-tolerance and failure recovery is still an open topic. It often involves classic fault-tolerance mechanisms, such as alternative versions and rollback with re-runs, reliance on the fault-tolerance capabilities provided by subcomponents and lower layers such as schedulers, Grid and cloud resources, or the underlying operating systems. When failures occur at the underlying layers, a workflow system sees this as failed steps in the process, but frequently without additional detail. This limits S-WFMS' ability to recover from failures. We describe a light weight end-to-end S-WFMS fault-tolerance framework, developed to handle failure patterns that occur in some real-life scientific workflows. Capabilities and limitations of the framework are discussed and assessed using simulations. The results show that the solution considerably increase workflow reliability and execution time stability.