Abstract
The Seer Suite digital library search engine framework is used to build tools such as CiteSeerx. It includes a complex metadata extraction system capable of extracting elements, such as author name, title, citations and citation contexts that are crucial bibliometric data and for building a citation graph. The workload faced by the exractor is dynamic in nature and this variability makes CiteSeerx attractive for hosting in a cloud computing environment. Given its application binary dependencies and its reliance on a specialized infrastructure, the current extractor has several limitations. These limitations motivated the design and implementation of the metadata extraction system proposed in this study. A message oriented middleware architecture is used with a publish/subscribe pattern to build a scalable, flexible system that can be deployed across a range of cloud infrastructure. To demonstrate the broad applicability of the proposed system, we evaluate it in terms of its reference implementation across different scenarios of deployment and in regard to its scalability.