Data processing: database and file management or data structures – Database and file access – Preparing data for information retrieval
Reexamination Certificate
2011-03-15
2011-03-15
Trujillo, James (Department: 2159)
Data processing: database and file management or data structures
Database and file access
Preparing data for information retrieval
C707S750000, C707S754000, C704S010000, C717S174000
Reexamination Certificate
active
07908279
ABSTRACT:
Systems and methods for filtering tokens from a document for determining whether the document describes substantially similar subject matter compared to another document are described. In one embodiment, a first document is obtained. This document is organized into a plurality of fields, and at least some of the fields include tokens representing the subject matter described by the document. A field of this document is selected and a token from within the selected field having the highest inverse document frequency (IDF) is selected. Those tokens that have a higher IDF than the selected token are removed. Using the remaining tokens, a determination is made as to whether the first document describes substantially similar subject matter to the subject matter described by a second document. An indication is provided as to whether the first document describes substantially similar subject matter to that described by a second document according to the determination.
REFERENCES:
patent: 4849898 (1989-07-01), Adi
patent: 5062074 (1991-10-01), Kleinberger
patent: 5261112 (1993-11-01), Futatsugi
patent: 5835892 (1998-11-01), Kanno
patent: 5960383 (1999-09-01), Fleischer
patent: 6038561 (2000-03-01), Snyder
patent: 6075896 (2000-06-01), Tanaka
patent: 6076086 (2000-06-01), Masuichi
patent: 6167398 (2000-12-01), Wyard
patent: 6173251 (2001-01-01), Ito
patent: 6263121 (2001-07-01), Melen
patent: 6606744 (2003-08-01), Mikurak
patent: 6810376 (2004-10-01), Guan
patent: 6961721 (2005-11-01), Chaudhuri et al.
patent: 7113943 (2006-09-01), Bradford
patent: 7346839 (2008-03-01), Acharya
patent: 7386441 (2008-06-01), Kempe
patent: 7426507 (2008-09-01), Patterson
patent: 7529756 (2009-05-01), Haschart et al.
patent: 7562088 (2009-07-01), Daga et al.
patent: 7567959 (2009-07-01), Patterson
patent: 7599914 (2009-10-01), Patterson
patent: 7603345 (2009-10-01), Patterson
patent: 2002/0016787 (2002-02-01), Kanno
patent: 2003/0065658 (2003-04-01), Matsubayashi
patent: 2003/0101177 (2003-05-01), Matsubayashi
patent: 2006/0112128 (2006-05-01), Brants
patent: 2006/0282415 (2006-12-01), Shibata
patent: 2007/0067157 (2007-03-01), Kaku et al.
patent: 2009/0119281 (2009-05-01), Wang et al.
patent: 2009/0204609 (2009-08-01), Labrou et al.
patent: 1 380 966 (2004-01-01), None
Bilenko et al, ‘Adaptive Name Matching in Information Integration’, 2003, IEEE Computer Society, pp. 16-23.
J. Ramos, ‘Using TF-IDF to Determine Word Relevance in Document Queries’, 2001, Citeseer, pp. 1-4.
A. Kilgarriff, ‘Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora’, 1997, Citeseer, pp. 231-245.
Conrad et al, ‘Online Duplicate Document Detection: Signature Reliability in a Dynamic Retrieval Environment’, Nov. 3-8, 2003, ACM, CIKM '03, pp. 443-452.
Ghahrmani, Z., and K.A. Heller, “Bayesian Sets,” Advances in Neural Information Processing Systems 18 (2006), 8 pages.
“Google Sets,” ©2007 Google, <http://labs.google.com/sets> [retrieved Feb. 13, 2008].
Emery Grant M.
Manoharan Aswath
Mohan Vijai
Terra Egidio
Thirumalai Srikanth
Amazon Technologies Inc.
Kowert Robert C.
Meyertons Hood Kivlin Kowert & Goetzel P.C.
Shechtman Cheryl M
Trujillo James
LandOfFree
Filtering invalid tokens from a document using high IDF... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Filtering invalid tokens from a document using high IDF..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Filtering invalid tokens from a document using high IDF... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2699907