Abstract
Partial or total duplication of document content is common to large digital libraries. In this paper, we present a copy detection system to automate the detection of duplication in digital documents. The system we present is sentence-based and makes three contributions: it proposes an intuitive definition of similarity between documents; it produces the distribution of overlap that exists between overlapping documents; it is resistant to inaccuracy due to large variations in document size. We report the results of several experiments that illustrate the behavior and functionality of the system.