Abstract
Studying link structure of the World Wide Web (WWW) is an area which has attracted a lot of interest in recent times. Several papers have been published on structural analysis of hyperlinked environments such as the WWW. The WWW can be modeled as a graph and valuable information can be derived by analyzing links between the web-pages primarily for the purpose of building better search engines. Many novel methods have been presented to discover communities from the WWW and discover authoritative web-pages. Citation analysis is a branch of information science on which plenty of research has been done. Citation analysis pertains to analysis of articles and research paper citations in a scholarly field and deriving useful information from it. It has primarily been used as a useful tool to quantify and judge the impact of a paper or a journal. The work presented in this paper lies at the intersection of the two fields: structural analysis of WWW and citation analysis. In this paper, we present a method for classifying documents (such as articles and patents containing references) to a class or topic based on their link structure, references and citations. The method consists of analyzing the link structure of a corpus to first identify authoritative papers and assigning a class label to them. The class labels are assigned manually by a domain expert by going through the respective documents. The next step consists of identifying related papers to the authoritative papers using citation analysis. The authoritative papers, their class labels and their related papers constitute a model. Papers for which class label needs to be determined are classified based on the created model.