Abstract
Many applications in various industrial and research areas analyze large continuously evolving data. Big data analytics platforms such as MapReduce focus on distributed batch processing, and therefore, a query needs to be re-executed every time its input data evolve. In this paper, we present IncReStore, a system that incrementally computes queries on fast growing datasets by materializing query outputs and maintaining them. IncReStore runs in two modes: (1) Opportunistic IncReStore generates compensating queries on the fly during their execution to use previously materialized query outputs taking into account that data might have evolved; and (2) Active IncReStore automatically generates MapReduce jobs to update the materialized query outputs whenever the datasets that they depend on evolve. We have implemented IncReStore as an extension to Pig and Hadoop. Our experimental evaluation of IncRestore using the TPC-H benchmark shows significant speedups.