Filtering invalid tokens from a document using high IDF...

Data processing: database and file management or data structures – Database and file access – Preparing data for information retrieval

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S750000, C707S754000, C704S010000, C717S174000

Reexamination Certificate

active

07908279

ABSTRACT:
Systems and methods for filtering tokens from a document for determining whether the document describes substantially similar subject matter compared to another document are described. In one embodiment, a first document is obtained. This document is organized into a plurality of fields, and at least some of the fields include tokens representing the subject matter described by the document. A field of this document is selected and a token from within the selected field having the highest inverse document frequency (IDF) is selected. Those tokens that have a higher IDF than the selected token are removed. Using the remaining tokens, a determination is made as to whether the first document describes substantially similar subject matter to the subject matter described by a second document. An indication is provided as to whether the first document describes substantially similar subject matter to that described by a second document according to the determination.

REFERENCES:
patent: 4849898 (1989-07-01), Adi
patent: 5062074 (1991-10-01), Kleinberger
patent: 5261112 (1993-11-01), Futatsugi
patent: 5835892 (1998-11-01), Kanno
patent: 5960383 (1999-09-01), Fleischer
patent: 6038561 (2000-03-01), Snyder
patent: 6075896 (2000-06-01), Tanaka
patent: 6076086 (2000-06-01), Masuichi
patent: 6167398 (2000-12-01), Wyard
patent: 6173251 (2001-01-01), Ito
patent: 6263121 (2001-07-01), Melen
patent: 6606744 (2003-08-01), Mikurak
patent: 6810376 (2004-10-01), Guan
patent: 6961721 (2005-11-01), Chaudhuri et al.
patent: 7113943 (2006-09-01), Bradford
patent: 7346839 (2008-03-01), Acharya
patent: 7386441 (2008-06-01), Kempe
patent: 7426507 (2008-09-01), Patterson
patent: 7529756 (2009-05-01), Haschart et al.
patent: 7562088 (2009-07-01), Daga et al.
patent: 7567959 (2009-07-01), Patterson
patent: 7599914 (2009-10-01), Patterson
patent: 7603345 (2009-10-01), Patterson
patent: 2002/0016787 (2002-02-01), Kanno
patent: 2003/0065658 (2003-04-01), Matsubayashi
patent: 2003/0101177 (2003-05-01), Matsubayashi
patent: 2006/0112128 (2006-05-01), Brants
patent: 2006/0282415 (2006-12-01), Shibata
patent: 2007/0067157 (2007-03-01), Kaku et al.
patent: 2009/0119281 (2009-05-01), Wang et al.
patent: 2009/0204609 (2009-08-01), Labrou et al.
patent: 1 380 966 (2004-01-01), None
Bilenko et al, ‘Adaptive Name Matching in Information Integration’, 2003, IEEE Computer Society, pp. 16-23.
J. Ramos, ‘Using TF-IDF to Determine Word Relevance in Document Queries’, 2001, Citeseer, pp. 1-4.
A. Kilgarriff, ‘Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora’, 1997, Citeseer, pp. 231-245.
Conrad et al, ‘Online Duplicate Document Detection: Signature Reliability in a Dynamic Retrieval Environment’, Nov. 3-8, 2003, ACM, CIKM '03, pp. 443-452.
Ghahrmani, Z., and K.A. Heller, “Bayesian Sets,” Advances in Neural Information Processing Systems 18 (2006), 8 pages.
“Google Sets,” ©2007 Google, <http://labs.google.com/sets> [retrieved Feb. 13, 2008].

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Filtering invalid tokens from a document using high IDF... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Filtering invalid tokens from a document using high IDF..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Filtering invalid tokens from a document using high IDF... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2699907

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.