Document similarity detection

Data processing: database and file management or data structures – Database and file access – Preparing data for information retrieval

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

07734627

ABSTRACT:
A similarity detector detects similar or near duplicate occurrences of a document. The similarity detector determines similarity of documents by characterizing the documents as clusters each made up of a set of term entries, such as pairs of terms. A pair of terms, for example, indicates that the first term of the pair occurs before the second term of the pair in the underlying document. Another document that has a threshold level of term entries in common with a cluster is considered similar to the document characterized by the cluster.

REFERENCES:
patent: 4358824 (1982-11-01), Glickman et al.
patent: 4691341 (1987-09-01), Knoble et al.
patent: 4823306 (1989-04-01), Barbic et al.
patent: 4839853 (1989-06-01), Deerwester et al.
patent: 5297039 (1994-03-01), Kanaegami et al.
patent: 5321833 (1994-06-01), Chang et al.
patent: 5418951 (1995-05-01), Damashek
patent: 5442546 (1995-08-01), Kaji et al.
patent: 5442778 (1995-08-01), Pedersen et al.
patent: 5619709 (1997-04-01), Caid et al.
patent: 5640553 (1997-06-01), Schultz
patent: 5652898 (1997-07-01), Kaji
patent: 5675819 (1997-10-01), Schuetze
patent: 5805771 (1998-09-01), Muthusamy et al.
patent: 5867811 (1999-02-01), O'Donoghue
patent: 5909677 (1999-06-01), Broder et al.
patent: 5913185 (1999-06-01), Martino et al.
patent: 5913208 (1999-06-01), Brown et al.
patent: 5926812 (1999-07-01), Hilsenrath et al.
patent: 5963940 (1999-10-01), Liddy et al.
patent: 6098033 (2000-08-01), Richardson et al.
patent: 6112021 (2000-08-01), Brand
patent: 6119124 (2000-09-01), Broder et al.
patent: 6161130 (2000-12-01), Horvitz et al.
patent: 6169999 (2001-01-01), Kanno
patent: 6192360 (2001-02-01), Dumais et al.
patent: 6621930 (2003-09-01), Smadja
patent: 6687696 (2004-02-01), Hofmann et al.
patent: 6990628 (2006-01-01), Palmer et al.
patent: 7188106 (2007-03-01), Dwork et al.
“A Bayesian Approach to Filtering Junk E-Mail” by M. Sahami et al. Published 1998. Accessed Jun. 18, 2006. Available from: http://research.microsoft.com/users/horvitz/junkfilter.htm.
“Inverted index—Wikipedia” by Wikipedia. Accessed Jun. 18, 2006. Available from: http://en.wikipedia.org/wiki/Inverted—index.
“Inverted index” by P.E. Black in “Dictionary of Algorithms and Data Structures” by U.S. National Institute of Standards and Technology. Dec. 17, 2004. Accessed Jun. 18, 2006. Available from: http://www.nist.gov/dads/HTML/invertedIndex.html.
“Levenshtein Distance, in Three Flavors” by Michael Gilleland. Available online at http://www.merriampark.com/Id.htm. Accessed Jan. 8, 2006.
Paul Graham. “A Plan for Spam” Published Aug. 2002. Accessed Sep. 28, 2007. Available online at http://www.paulgraham.com/spam.html.
Paul Graham. “Better Bayesian Filtering” Published Jan. 2003. Accessed Sep. 28, 2007. Available online at http://www.paulgraham.com/better.html.
H. Drucker, W.Donghui, V.N. Vapnik, “Support vector machines for spam categorization,” Neural Networks, IEEE Transactions on, vol. 10, No. 5, pp. 1048-1054, Sep. 1999.
Baker, L. D. and McCallum, A. K. 1998. Distributional clustering of words for text classification. In Proceedings of the 21st Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Melbourne, Australia, Aug. 24-28, 1998). SIGIR '98. ACM Press, NY, NY, 96-103. DOI= http://doi.acm.org/10.1145/290941.290970.
Cohen, W. W. 1996a. Learning rules that classify e-mail. In Papers from the AAAI Spring Symposium on Machine Learning in Information Access, 18-25. http://citeseer.ist.psu.edu/cohen96learning.html.
Rennie, Jason. “ifile: An Application of Machine Learning to E-Mail Filtering.” CMU, Dec. 1998. http://citeseer.ist.psu.edu/article/rennie98ifile.html.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407, http://citeseer.ist.psu.edu/deerwester90indexing.html.
Cliff. “Ask Slashdot: Seeking Prior Art on Markov-Based SPAM Filters?” Published Nov. 28, 2002. Accessed Sep. 28, 2007. Available online at: http://ask.slashdot.org/article.pl?sid=02/11/27/0841216.
Cutting, D. and Pedersen, J. 1990. Optimization for dynamic inverted index maintenance. In Proceedings of the 13th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Brussels, Belgium, Sep. 5-7, 1990). J. Vidick, Ed. SIGIR '90. ACM, New York, NY, 405-411.
Zhai, C. 1997. Fast statistical parsing of noun phrases for document indexing. In Proceedings of the Fifth Conference on Applied Natural Language Processing (Washington, DC, Mar. 31-Apr. 3, 1997). Applied Natural Language Conferences. Association for Computational Linguistics, Morristown, NJ, 312-319. DOI= http://dx.doi.org/10.3115/974557.9746.
H. Drucker, W. Donghui, V.N. Vapnik, “Support vector machines for spam categorization,” Neural Networks, IEEE Transactions on, vol. 10, No. 5, pp. 1048-1054, Sep. 1999.
A. Chowdhury et al.: “Collection Statistics for Fast Duplicate Document Detection,” pp. 1-30, Apr. 2002.
Co-pending U.S. Appl. No. 10/425,819, filed Apr. 30, 2003 entitled “Systems and Methods for Predicting Lists,” 36 page specification, 16 sheets of drawings.
Co-pending Application entitled “Detecting Duplicate and Near-Duplicate Files,” filed Jan. 24, 2001; William Pugh et al.; 61 page specification; 18 sheets of drawings.
Lecture Notes, CS276A, “Text Information Retrieval, Mining, and Exploitation,” Nov. 19, 2002, http://www.stanford.edu/class/cs276a/handouts/lecture—13-gin1.pdf.
Andrei Z. Broder et al.: “Syntactic Clustering of the Web,”Proc. 6thInternational World Wide Web Conference; Apr. 1997 and SRC Technical Note 1997-015; Jul. 25, 1997; pp. 1-14.
Andrei Z. Broder: “On the resemblance and containment of documents,”Proc. Of Compression and Complexity of Sequences 1997; IEEE Computer Society; pp. 1-9.
Sergey Brin et al.; “Copy Detection Mechanisms for Digital Documents,”Proc. Of ACM SIGMOD Annual Conference; San Jose, CA 1995; pp. 1-21.
Andrei Z. Broder: “Some applications of Rabin's fingerprinting method,”Sequences II: Methods in Communications, Security, and Computer Science; (Springer-Verlag, 1993); pp. 1-10.
Min Fang et al.: “Computing Iceberg Queries Efficiently,”Proc. 24thInternational Conference on Very Large Databases; (1998); pp. 1-25.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Document similarity detection does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Document similarity detection, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Document similarity detection will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-4174023

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.