Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase

Artem Chebotko; John Abraham; Pearl Brazier; Anthony Piazza; Andrey Kashlev; Shiyong Lu

doi:10.1109/SERVICES.2013.32

Abstract

Provenance, which records the history of an in-silico experiment, has been identified as an important requirement for scientific workflows to support scientific discovery reproducibility, result interpretation, and problem diagnosis. Large provenance datasets are composed of many smaller provenance graphs, each of which corresponds to a single workflow execution. In this work, we explore and address the challenge of efficient and scalable storage and querying of large collections of provenance graphs serialized as RDF graphs in an Apache HBase database. Specifically, we propose: (i) novel storage and indexing techniques for RDF data in HBase that are better suited for provenance datasets rather than generic RDF graphs and (ii) novel SPARQL query evaluation algorithms that solely rely on indices to compute expensive join operations, make use of numeric values that represent triple positions rather than actual triples, and eliminate the need for intermediate data transfers over a network. The empirical evaluation of our algorithms using provenance datasets and queries of the University of Texas Provenance Benchmark confirms that our approach is efficient and scalable.

Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase

Authors

Abstract

Related Articles