Abstract
The computing paradigm of MapReduce has gained extreme popularity in the area of large-scale data-intensive applications in recent years. Hadoop, an open-source implementation of MapReduce, can be set up easily and rapidly on commodity hardware to form a massive computing cluster. In such a cluster, task failures and node failures are not an anomaly, which will cause a substantial impact on Hadoop's performance. Although Hadoop can restart failed tasks automatically and compensate for slow tasks by enabling speculative execution, many researchers have identified the shortcomings of Hadoop's fault tolerance. In this research, we try to improve them by designing a simple check pointing mechanism for Map tasks, and using a revised criterion for identifying slow tasks. Specifically, our check pointing mechanism saves the partial output produced by the Mappers, and our criterion for identifying slow tasks considers tasks with variable progress rates. By preliminary simulations, although the results show only marginal performance improvement compared with native Hadoop and the LATE scheduler, we believe that our approaches have the potential to offer greater performance gain on real workloads.