Document near-duplicate detection

Data processing: database and file management or data structures – Database and file access – Preparing data for information retrieval

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S738000, C707S748000

Reexamination Certificate

active

07962491

ABSTRACT:
A near-duplicate component includes a fingerprint creation component and a similarity detection component. The fingerprint creation component receives a document of arbitrary size and generates a compact “fingerprint” that describes the contents of the document. The similarity detection component compares multiple fingerprints based on the hamming distance between the fingerprints. When the hamming distance is below a threshold, the documents can be said to be near-duplicates of one another.

REFERENCES:
patent: 4290105 (1981-09-01), Cichelli et al.
patent: 5745900 (1998-04-01), Burrows
patent: 5845005 (1998-12-01), Setlak et al.
patent: 6230155 (2001-05-01), Broder et al.
patent: 6286006 (2001-09-01), Bharat et al.
patent: 6351755 (2002-02-01), Najork et al.
patent: 6393438 (2002-05-01), Kathrow et al.
patent: 6658423 (2003-12-01), Pugh et al.
patent: 2002/0133499 (2002-09-01), Ward et al.
patent: 2003/0105716 (2003-06-01), Sutton et al.
patent: 2003/0208761 (2003-11-01), Wasserman et al.
patent: 2005/0210043 (2005-09-01), Manasse
patent: 2006/0036684 (2006-02-01), Schwerk
Broder, Andrei et al., “Clustering the Web”, http://research.compaq.com/src/articles/199707/cluster.html; Systems Research Center, Palo Alto, California, Feb. 4, 2004 (Print Date), 5 pages.
Narayanan Shivakumar et al.; SCAM: A Copy Detection Mechanism for Digital Documents; Proceedings of the 2ndInternational Conference on Theory and Practice of Digital Libraries; 1995; 9 pages.
Saul Schleimer; Winnowing: Local Algorithms for Document Fingerprinting; Proceedings of SIGMOD 2003; Jun. 9-12, 2003; 10 pages.
Moses S. Charikar; Similarity Estimation Techniques from Rounding Algorithms; STOC 2002; May 19-21, 2002; 9 pages.
Gautam Pant et al.; Crawling the Web; 2003; 25 pages.
Michael O. Rabin; Fingerprinting by Random Polynomials; Technical Report TR-15-81; Harvard University, 1981; 14 pages.
Office Action from U.S. Appl. No. 11/094,791, dated Jul. 23, 2007; 25 pages.
Arvind Jain et al.; U.S. Appl. No. 11/094,791, filed Mar. 31, 2005; “Near Duplicate Document Detection for Web Crawling”; 46 pages.
Co-pending U.S. Appl. No. 10/808,326, filed Mar. 25, 2004 entitled “Document Near-Duplicate Detection” by Shioupyn Shen, 33 pages.
Arvind Arasu et al.; Extracting Structured Data from Web Pages; Proceedings of the ACM SIGMOD 2003, 2003; 30 pages.
Brenda S. Baker; A Theory of Parameterized Pattern Matching: Algorithms and Applications (Extended Abstract); 25thACM STOC 1993; 1993; pp. 71-80.
Brenda S. Baker; On Finding Duplication and Near-Duplication in Large Software Systems; Proceedings of the 2ndWorking Conference on Reverse Engineering; 1995; 10 pages.
Andrei Z. Broder; On the Resemblance and Containment of Documents; Proceedings of the Compression and Complexity of Sequences; 1997; pp. 1-9.
Krishna Bharat et al.; Mirror, Mirror on the Web: A Study of Host Pairs of Replicated Content; Proceedings of the 8thInternational Conference on World Wide Web (WWW 1999); 17 pages.
Krishna Bharat et al.; A Comparison of Techniques to Find Mirrored Hosts on the WWW; Journal of the American Society for Information Science; 2000; 11 pages.
Krishna Bharat; The Connectivity Server: Fast Access to Linkage Information on the Web; Proceedings of the 7thInternational Conference on the World Wide Web; 1998; 13 pages.
Andrei Z. Broder et al.; Min-Wise Independent Permutations; Proceedings of STOC; 1998; pp. 630-659.
Sergey Brin et al.; Detection Mechanisms for Digital Documents; Proceedings of the ACM SIGMOD Annual Conference, 1995; 12 pages.
Andrei Z. Broder et al.; Syntactic Clustering of the Web; Proceedings of WWW6, 1997; 13 pages.
James W. Cooper et al.; Detecting Similar Documents Using Salient Terms; Proceedings of the CIKM 2002; Nov. 2002; pp. 245-251.
Edith Cohen et al.; Finding Interesting Associations without Support Pruning; Proceedings of the 16thICDE; 2000; 12 pages.
Abdur Chowdhury et al.; Collection Statistics for Fast Duplicate Document Detection; ACM Transactions on Information Systems; vol. 20, No. 2; Apr. 2002; pp. 171-191.
Z iyuan Chen et al.; Selectively Estimation for Boolean Queries; Proceedings of PODS 2000; 2000; 10 pages.
Jack G. Conrad et al.; Constructing a Text Corpus for Inexact Duplicate Detection; SIGIR 2004; Jul. 2004; pp. 582-583.
Scott Deerwester et al.; Indexing by Latent Semantic Analysis; Journal of the American Society for Information Science; 1990; 34 pages.
Jeffrey Dean et al.; Finding Related Pages in the World Wide Web; Proceedings of the Eighth International World Wide Web Conference; 1999; pp. 1-15.
Aristides Gionis et al.; Efficient and Tunable Similar Set Retrieval; Proceedings of SIGMOD 2001; 2001; 12 pages.
Taher H. Haveliwala et al.; Scalable Techniques for Clustering the Web; Proceedings of the 3rdInternational Workshop on the Web and Databases (WebDB 2000); 2000; 6 pages.
Taher H. Haveliwala et al.; Evaluating Strategies for Similarity Search on the Web; Proceedings of the 11thInternational World Wide Web Conference; May 2002; 11 pages.
Khaled M. Hammouda et al.; Efficient Phrase-Based Document Indexing for Web Document Clustering; IEEE Transactions on Knowledge and Data Engineering; vol. 16, No. 10, Oct. 2004; pp. 1279-1296.
Timothy C. Hoad et al.; Methods for Identifying Versioned and Plagiarised Documents; Journal of the American Society for Information Science and Technology; 2003; pp. 1-18.
Sachindra Joshi et al.; A Bag of Paths Model for Measuring Structural Similarity in Web Documents; Proceedings of the 9thACM International Conference on Knowledge Discovery and Data Mining (SIGKDD 2003); Aug. 2003; pp. 577-582.
Jon M. Kleinberg; Authoritative Sources in a Hyperlinked Environment; Journal of the ACM; 1999; 34 pages.
Aleksander Kotcz et al.; Improved Robustness of Signature-Based Near-Replica Detection via Lexicon Randomization; SIGKDD 2004; Aug. 2004; 6 pages.
Ravi Kumar et al.; Trawling the Web for Emerging Cyber-Communities; Computer Networks: The International Journal of Computer and Telecommunications Networks; 1999; 21 pages.
Udi Manber; Finding Similar Files in a Large File System; Proceedings of the 1994 USENIX Conference; 1994; 11 pages.
Athicha Muthitacharoen et al.; A Low-Bandwidth Network File System; Proceedings of the 18thACM Symposium on Operating System Principles (SOSP 2001); 2001; 14 pages.
Sean Quinlan et al.; Venti: A new Approach to Archival Storage; First USENIX Conference on File and Storage Technologies; 2002; 13 pages.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Document near-duplicate detection does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Document near-duplicate detection, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Document near-duplicate detection will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2736640

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.