Abstract
Near duplicate web pages are web pages that differ only slightly in content. The existence of near duplicate web pages are due to exact replica of the original site, mirrored sites, versioned sites, and multiple representations of the same physical object and plagiarized documents. The identification of similar or near duplicate pages in a large collection is a significant problem with wide spread applications. Here we propose a four stage algorithm for finding near duplicates of an input Web page from a huge repository. We propose a Term Document Weight (TDW) matrix based algorithm with four phases - preprocessing, Feature weighting, Filtering and Verification. The system receives an input web page and a similarity threshold in its first phase and performs some pre processing operations on it. In the second phase, weights of features are calculated using Analytic Combination Criteria (ACC). In the third phase, Prefix and Positional filtering are performed to reduce the size of candidate records, and it returns an optimal set of near duplicate web pages in the Verification phase after calculating their similarity using Minimum Weight Overlapping (MWO) method.