Systems and methods for identifying similar documents

Data processing: database and file management or data structures – Database and file access – Record – file – and data search and comparisons

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S705000, C707S802000

Reexamination Certificate

active

07958136

ABSTRACT:
The present invention provides systems and methods for identifying similar documents. In an embodiment, the present invention identifies similar documents by (1) receiving document text for a current document that includes at least one word; (2) calculating a prominence score and a descriptiveness score for each word and each pair of consecutive words; (3) calculating a comparison metric for the current document; (4) finding at least one potential document, where document text for each potential document includes at least one of the words; and (5) analyzing each potential document to identify at least one similar document.

REFERENCES:
patent: 5826261 (1998-10-01), Spencer
patent: 5867799 (1999-02-01), Lang et al.
patent: 5963893 (1999-10-01), Halstead et al.
patent: 6366759 (2002-04-01), Burstein et al.
patent: 6385579 (2002-05-01), Padmanabhan et al.
patent: 6507839 (2003-01-01), Ponte
patent: 6678694 (2004-01-01), Zimmermann et al.
patent: 6810376 (2004-10-01), Guan et al.
patent: 6990628 (2006-01-01), Palmer et al.
patent: 7185001 (2007-02-01), Burdick et al.
patent: 7200587 (2007-04-01), Matsubayashi et al.
patent: 7246117 (2007-07-01), Peh
patent: 7370034 (2008-05-01), Franciosa et al.
patent: 7536413 (2009-05-01), Mohan et al.
patent: 7567953 (2009-07-01), Kadayam et al.
patent: 7765218 (2010-07-01), Bates et al.
patent: 2003/0018629 (2003-01-01), Namba
patent: 2004/0015342 (2004-01-01), Garst
patent: 2004/0044952 (2004-03-01), Jiang et al.
patent: 2004/0111264 (2004-06-01), Wang et al.
patent: 2005/0228783 (2005-10-01), Shanahan et al.
patent: 2005/0256712 (2005-11-01), Yamada et al.
patent: 2006/0112068 (2006-05-01), Zhang et al.
patent: 2006/0149820 (2006-07-01), Rajan et al.
patent: 2006/0230033 (2006-10-01), Halevy et al.
patent: 2006/0242190 (2006-10-01), Wnek
patent: 2007/0019864 (2007-01-01), Koyama et al.
patent: 2007/0112898 (2007-05-01), Evans et al.
patent: 2007/0174267 (2007-07-01), Patterson et al.
patent: 2008/0205775 (2008-08-01), Brinker et al.
patent: 2009/0037389 (2009-02-01), Kothari et al.
patent: 2009/0125498 (2009-05-01), Cao et al.
patent: 2009/0125805 (2009-05-01), Ananthanarayanan et al.
patent: 2009/0198677 (2009-08-01), Sheehy et al.
Wang et al., “An Unsupervised Quantitative Measure for Word Prominence in Spontaneous Speech”, IEEE, 2005, pp. 377-380, accessed online at <http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=01415129&tag=1> on Jul. 1, 2010.
Brants et al., “Finding Similar Documents in Document Collection”, Proceedings of the Third International Conference on Language Resources and Evaluation, 2002, 7 pages.
Wan et al., “Document Similarity Search Based on Generic Summaries”, AIRS 2005, pp. 635-640.
Cooper et al., “A Novel Method for Detecting Similar Documents”, Proceedings of the 35th Annual Hawaii International Conference on System Sciences, 2002, 7 pages.
Cooper et al., “Anti-Serendipity—Finding Useless Documents and Similar Documents”, Proceedings of the 33rd Hawaii International Conference on System Sciences, 2000, 9 pages.
E. Gacia, “Term Vector Theory and Keyword Weights—An Introduction Series on Term Vector Theory for Information Retrieval Students and Search Engines Marketers”, 2006, 5 pages, accessed online at <http://www.miislita.com/term-vector/term-vector-1.html> on Jan. 8, 2011.
Lee et al., “Using Fuzzy-Word Correlation Factors to Compute Document Similarity Based on Phrase Matching”, FSKD 2007, Aug. 24-27, 2007, vol. 2, pp. 186-192.
Lai et al., “Similarity Score for Information Filtering Thresholds”, ISCIT 2004, Oct. 26-29, 2004, vol. 1, pp. 216-221.
Li et al., “An Efficient Document Categorization model Based on LSA and BPNN”, ALPIT 2007, Aug. 22-24, 2007, pp. 9-14.
Tata et al., “Estimating the Selectivity oftf-idfbased Cosine Similarity Predicates”, SIGMOD Record, Jun. 2007, vol. 36, No. 2, pp. 7-12.
Blei et al., “Latent Dirichlet Allocation”, Journal of Machine Learning Research 3, Jan. 2003, pp. 993-1022.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Systems and methods for identifying similar documents does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Systems and methods for identifying similar documents, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Systems and methods for identifying similar documents will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2715906

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.