Document near-duplicate detection

Data processing: database and file management or data structures – Database and file access – Preparing data for information retrieval

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details Document near-duplicate detection Document near-duplicate detection

: 2011-06-14
: 2011-06-14
: Vo, Tim T. (Department: 2168)
: Data processing: database and file management or data structures
: Database and file access
: Preparing data for information retrieval

: C707S738000, C707S748000
: Reexamination Certificate
: active
: 07962491
: ABSTRACT:
A near-duplicate component includes a fingerprint creation component and a similarity detection component. The fingerprint creation component receives a document of arbitrary size and generates a compact “fingerprint” that describes the contents of the document. The similarity detection component compares multiple fingerprints based on the hamming distance between the fingerprints. When the hamming distance is below a threshold, the documents can be said to be near-duplicates of one another.

REFERENCES:
patent: 4290105 (1981-09-01), Cichelli et al.
patent: 5745900 (1998-04-01), Burrows
patent: 5845005 (1998-12-01), Setlak et al.
patent: 6230155 (2001-05-01), Broder et al.
patent: 6286006 (2001-09-01), Bharat et al.
patent: 6351755 (2002-02-01), Najork et al.
patent: 6393438 (2002-05-01), Kathrow et al.
patent: 6658423 (2003-12-01), Pugh et al.
patent: 2002/0133499 (2002-09-01), Ward et al.
patent: 2003/0105716 (2003-06-01), Sutton et al.
patent: 2003/0208761 (2003-11-01), Wasserman et al.
patent: 2005/0210043 (2005-09-01), Manasse
patent: 2006/0036684 (2006-02-01), Schwerk
Broder, Andrei et al., “Clustering the Web”, http://research.compaq.com/src/articles/199707/cluster.html; Systems Research Center, Palo Alto, California, Feb. 4, 2004 (Print Date), 5 pages.
Narayanan Shivakumar et al.; SCAM: A Copy Detection Mechanism for Digital Documents; Proceedings of the 2ndInternational Conference on Theory and Practice of Digital Libraries; 1995; 9 pages.
Saul Schleimer; Winnowing: Local Algorithms for Document Fingerprinting; Proceedings of SIGMOD 2003; Jun. 9-12, 2003; 10 pages.
Moses S. Charikar; Similarity Estimation Techniques from Rounding Algorithms; STOC 2002; May 19-21, 2002; 9 pages.
Gautam Pant et al.; Crawling the Web; 2003; 25 pages.
Michael O. Rabin; Fingerprinting by Random Polynomials; Technical Report TR-15-81; Harvard University, 1981; 14 pages.
Office Action from U.S. Appl. No. 11/094,791, dated Jul. 23, 2007; 25 pages.
Arvind Jain et al.; U.S. Appl. No. 11/094,791, filed Mar. 31, 2005; “Near Duplicate Document Detection for Web Crawling”; 46 pages.
Co-pending U.S. Appl. No. 10/808,326, filed Mar. 25, 2004 entitled “Document Near-Duplicate Detection” by Shioupyn Shen, 33 pages.
Arvind Arasu et al.; Extracting Structured Data from Web Pages; Proceedings of the ACM SIGMOD 2003, 2003; 30 pages.
Brenda S. Baker; A Theory of Parameterized Pattern Matching: Algorithms and Applications (Extended Abstract); 25thACM STOC 1993; 1993; pp. 71-80.
Brenda S. Baker; On Finding Duplication and Near-Duplication in Large Software Systems; Proceedings of the 2ndWorking Conference on Reverse Engineering; 1995; 10 pages.
Andrei Z. Broder; On the Resemblance and Containment of Documents; Proceedings of the Compression and Complexity of Sequences; 1997; pp. 1-9.
Krishna Bharat et al.; Mirror, Mirror on the Web: A Study of Host Pairs of Replicated Content; Proceedings of the 8thInternational Conference on World Wide Web (WWW 1999); 17 pages.
Krishna Bharat et al.; A Comparison of Techniques to Find Mirrored Hosts on the WWW; Journal of the American Society for Information Science; 2000; 11 pages.
Krishna Bharat; The Connectivity Server: Fast Access to Linkage Information on the Web; Proceedings of the 7thInternational Conference on the World Wide Web; 1998; 13 pages.
Andrei Z. Broder et al.; Min-Wise Independent Permutations; Proceedings of STOC; 1998; pp. 630-659.
Sergey Brin et al.; Detection Mechanisms for Digital Documents; Proceedings of the ACM SIGMOD Annual Conference, 1995; 12 pages.
Andrei Z. Broder et al.; Syntactic Clustering of the Web; Proceedings of WWW6, 1997; 13 pages.
James W. Cooper et al.; Detecting Similar Documents Using Salient Terms; Proceedings of the CIKM 2002; Nov. 2002; pp. 245-251.
Edith Cohen et al.; Finding Interesting Associations without Support Pruning; Proceedings of the 16thICDE; 2000; 12 pages.
Abdur Chowdhury et al.; Collection Statistics for Fast Duplicate Document Detection; ACM Transactions on Information Systems; vol. 20, No. 2; Apr. 2002; pp. 171-191.
Z iyuan Chen et al.; Selectively Estimation for Boolean Queries; Proceedings of PODS 2000; 2000; 10 pages.
Jack G. Conrad et al.; Constructing a Text Corpus for Inexact Duplicate Detection; SIGIR 2004; Jul. 2004; pp. 582-583.
Scott Deerwester et al.; Indexing by Latent Semantic Analysis; Journal of the American Society for Information Science; 1990; 34 pages.
Jeffrey Dean et al.; Finding Related Pages in the World Wide Web; Proceedings of the Eighth International World Wide Web Conference; 1999; pp. 1-15.
Aristides Gionis et al.; Efficient and Tunable Similar Set Retrieval; Proceedings of SIGMOD 2001; 2001; 12 pages.
Taher H. Haveliwala et al.; Scalable Techniques for Clustering the Web; Proceedings of the 3rdInternational Workshop on the Web and Databases (WebDB 2000); 2000; 6 pages.
Taher H. Haveliwala et al.; Evaluating Strategies for Similarity Search on the Web; Proceedings of the 11thInternational World Wide Web Conference; May 2002; 11 pages.
Khaled M. Hammouda et al.; Efficient Phrase-Based Document Indexing for Web Document Clustering; IEEE Transactions on Knowledge and Data Engineering; vol. 16, No. 10, Oct. 2004; pp. 1279-1296.
Timothy C. Hoad et al.; Methods for Identifying Versioned and Plagiarised Documents; Journal of the American Society for Information Science and Technology; 2003; pp. 1-18.
Sachindra Joshi et al.; A Bag of Paths Model for Measuring Structural Similarity in Web Documents; Proceedings of the 9thACM International Conference on Knowledge Discovery and Data Mining (SIGKDD 2003); Aug. 2003; pp. 577-582.
Jon M. Kleinberg; Authoritative Sources in a Hyperlinked Environment; Journal of the ACM; 1999; 34 pages.
Aleksander Kotcz et al.; Improved Robustness of Signature-Based Near-Replica Detection via Lexicon Randomization; SIGKDD 2004; Aug. 2004; 6 pages.
Ravi Kumar et al.; Trawling the Web for Emerging Cyber-Communities; Computer Networks: The International Journal of Computer and Telecommunications Networks; 1999; 21 pages.
Udi Manber; Finding Similar Files in a Large File System; Proceedings of the 1994 USENIX Conference; 1994; 11 pages.
Athicha Muthitacharoen et al.; A Low-Bandwidth Network File System; Proceedings of the 18thACM Symposium on Operating System Principles (SOSP 2001); 2001; 14 pages.
Sean Quinlan et al.; Venti: A new Approach to Archival Storage; First USENIX Conference on File and Storage Technologies; 2002; 13 pages.

Affiliated with

Shen Shioupyn

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

Google Inc.

Corporate Assignee

[ 0.00 ] – not rated yet Voters 0 Comments 0

Harrity & Harrity LLP

Law Firm

[ 0.00 ] – not rated yet Voters 0 Comments 0

Smith Garrett

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

Vo Tim T.

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Document near-duplicate detection does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Document near-duplicate detection, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Document near-duplicate detection will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-2736640

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure