Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-12-01
2001-09-04
Black, Thomas (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000
Reexamination Certificate
active
06286000
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to information retrieval in data processing systems and, more particularly, to a document matcher that matches new documents to a database of stored documents in order to find the most relevant matches.
2. Background Description
The problem of document matching against documents stored in a database has been addressed before, but the previous versions require substantial storage and computing resources. They employ much more complicated document representations and document matching algorithms.
U.S. Pat. No. 4,358,824 to Glickman et al. discloses an office correspondence storage and retrieval system. Keywords are selected from a document using a part of speech dictionary. Comparison between a document and a query uses the part of speech and position of occurrence in the document, the number of pages in a document and whether or not the document includes a month and year. The present invention does not use any of these features.
U.S. Pat. No. 4,817,036 to Millett et al. discloses a computer system and method for data base indexing and information retrieval. In this system, an inverted index of the document data base is computed and stored. Query keywords are looked up in the index and the bit strings are manipulated to produce an answer vector from which the matching documents can be found. Aside from the generic use of key words, this is entirely different from the present invention.
U.S. Pat. No. 5,371,807 to Register et al. discloses a method and apparatus for text classification. This invention describes a system in which the recognized keywords are used to deduce further facts about the document which are then used to compute category membership. The present invention does not use a fact data base for any purpose.
U.S. Pat. No. 5,418,948 to Turtle discloses concept matching of natural language queries with a database of document concepts. In this invention, query words are stemmed and sequences of stems are looked for in a phrase dictionary. The list of stemmed words and found phrases are used as query nodes in a query network which is matched against a document network. The present invention uses neither phrases nor query networks.
U.S. Pat. No. 5,694,559 to Hobson et al. discloses on-line help method and system utilizing free text query. After identifying query keywords, this invention performs disambiguation, and other forms of analysis. Each keyword is then associated with a concept. Each concept has a likelihood of being associated with a help topic. The present invention does not require analysis of identified keywords and does not have a defined set of concepts and probabilities associated with help topics.
SUMMARY OF THE INVENTION
It is therefore an object of the present invention to provide a document matching solution that employs minimal processing and storage and is therefore suitable for installation directly in restricted environments, such as mobile or small desktop computers.
According to the invention, there is provided a lightweight document matcher that matches new documents to those stored in a database. The matcher lists, in order, those stored documents that are most similar to the new document. The new documents are typically problem statements or queries, and the stored documents are potential solutions such as FAQs (Frequently Asked Questions). Given a set of documents, titles, and possibly keywords, an automatic back-end process constructs a global dictionary of unique keywords and local dictionaries of relevant words for each document. The application front-end uses this information to score the relevance of stored documents to new documents. The scoring algorithm uses the count of matched words as a base score, and then assigns bonuses to words that have high predictive value. It optionally assigns an extra bonus for a match of the words in special sections, such as titles. The method uses minimal data structures and lightweight scoring algorithms to compute efficiently even in restricted environments, such as mobile or small desktop computers.
Although the invention is designed for installation in, for example, mobile or small desktop computers, the invention can advantageously run on a large server. The approach taken in the practice of the invention is effective when resources are relatively scarce. What distinguishes the subject invention from traditional search engines are the local dictionary formation in the back-end process, the scoring computation in the front-end process, and the ability to accept as input a text stream of unlimited length.
REFERENCES:
patent: 4958284 (1990-09-01), Bishop et al.
patent: 5465353 (1995-11-01), Hull et al.
patent: 5848407 (1998-12-01), Ishikawa et al.
patent: 5963940 (1999-10-01), Liddy et al.
patent: 5987457 (1999-11-01), Ballard
patent: 6012057 (2000-01-01), Mayer et al.
Apte Chidanand
Damerau Frederick J.
Weiss Sholom M.
White Brian F.
Black Thomas
International Business Machines - Corporation
Kaufman Stephen C.
Le Uyen
McGuireWoods LLP
LandOfFree
Light weight document matcher does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Light weight document matcher, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Light weight document matcher will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2442679