Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-11-13
2002-04-09
Von Buhr, Maria N. (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000
Reexamination Certificate
active
06370525
ABSTRACT:
BACKGROUND
For most users, a search of a database for documents related to a particular topic begins with the formulation of a search query for use by a search engine. The search engine then identifies documents that match the specifications that the user sets forth in the search query. These documents are then presented to the user, usually in an order that attempts to approximate the extent to which the documents match the specifications of the search query.
In its simplest form, the search query might be no more than a word or a phrase. However, such simple search queries typically result in the retrieval of far too many documents, many of which are likely to be irrelevant. To avoid this, search engines provide a mechanism for narrowing the search, typically by allowing the user to specify some Boolean combination of words and phrases. More complex search queries allow a user to specify that two Boolean combinations be found within a particular distance, usually measured in words, from each other. Known search queries can also provide wildcard characters or mechanisms for including or excluding certain word variants.
Regardless of its complexity, a search query is fundamentally no more than a user's best guess as to the distribution of alphanumeric characters that is likely to occur in a document containing the information of interest. The success of a search query thus depends on the user's skill in formulating the search query and in the predictability of the documents in the database. Hence, a search query of this type is likely to be most successful when the documents in the database are either inherently structured or under editorial control. Because of the necessity for thorough editorial review, such databases tend to be either somewhat specialized (for example databases for patent searching or searching case law) or slow to change (for example, CD-ROM encyclopedias).
Because of its distributed nature, the internet offers a breadth of up-to-date information. However, documents posted on the internet are often posted with little editorial control. As a result, many documents are plagued with inconsistencies and errors that reduce the effectiveness of a search engine. In addition, because the internet has become an advertising medium, many sites seek to attract visitors. As a result, proprietors of those sites pepper their sites with invisible (to the reader) words, as bait for attracting the attention of search engines. The presence of such invisible words thwarts the search engine's attempt to judge the relevancy of a document solely by the distribution of words in the document.
The unreliability associated with many documents on the internet poses a difficult problem when a search engine attempts to rank the relevance of retrieved documents. Because all the search engine knows is the distribution of words, it can do no more than indicate that the distribution of words in a document does or does not match the search query more closely than the distribution of words in another document. This can result in such a prolixity of search results that it is impractical to examine them all. Moreover, because there is no absolute standard for relevance on the internet, there is no assurance that the most highly ranked document returned by a search engine is even relevant at all. It may simply be the least irrelevant document in a collection of irrelevant documents.
Attempts have been made to improve the searchability of the internet by having human editors assess the reliability and relevance of particular sites. Addresses to those sites meeting a threshold of reliability are then provided to the user. For example, major publishers of encyclopedias on CD-ROM provide pre-selected links to internet sites in order to augment the materials provided on the CD-ROM. However, these attempts are hampered by the fact that internet sites can change, both in content and in address, overnight. Thus, a reviewed site that may have existed on publication of the CD-ROM may no longer exist when a user subsequently attempts to activate that link.
It is apparent that the dynamic and free-form nature of the internet results in a highly diversified and current storehouse of reference materials. However, the uncontrolled nature of documents on the internet results in an environment that is not readily searchable in an efficient manner by a conventional search engine.
SUMMARY
In accord with the method and apparatus of this invention, the relevance of documents retrieved by a search engine operating in an uncontrolled public database is considerably improved by also searching a controlled database, and by using the search results from the controlled database to assess the relevance of the documents retrieved from the public database.
The method of the invention includes the identification and ranking of a plurality of candidate documents on the basis of the similarity of each of the candidate documents to a user-query.
This method includes the step of parsing the user-query to generate both a list of one or more query-words and a distribution, within the user-query, of the query-words in that list. The user-query can be provided by the user or it can be an excerpt of text selected from a document referred to by the user.
The importance of each query-word in the user-query is then assessed on the basis of the frequency with which the query-word occurs in a database of candidate documents. In an optional feature of the invention, the step of parsing the query includes the step of providing additional query-words, referred to as derivative query-words, which are associated with the original query-words provided by the user. These derivative query-words are accorded lesser importance in the identification of candidate documents than are original query-words.
A candidate document that has clusters of query-words is intuitively of more relevance to a user-query than is a candidate document with isolated occurrences of query-words. The former is likely to contain a coherent discussion of the subject matter of the user-query whereas the latter may refer to the subject matter of the user-query only tangentially. In some cases, an isolated occurrence of a query-word may be no more than a typographical error.
The method of the invention exploits the importance of query-word clustering to the identification of candidate documents similar, or relevant, to a user-query by evaluating the similarity of a candidate document to the user-query on the basis of the distribution, or clustering, of query-words within the particular candidate document. In a preferred embodiment, the step of evaluating this measure of document similarity, referred to as a “document conductance,” includes the step of determining the concentration, or distribution, of query-words in the candidate document. A document in which there exist regions of high concentration, or clustering, of query-words is indicative of a document that is similar to the query. Such a candidate document is therefore assigned a document conductance indicative of greater similarity to the user-query than a candidate document having fewer such query-word clusters.
Having evaluated the similarity of a large number of candidate documents to the user-query, the method of the invention now proceeds with an evaluation of the distribution, or clustering, of the query-words in the individual sentences that make up the candidate document. The similarity of a particular sentence to the user-query depends upon the concentration of query-words in a particular sentence.
In one preferred embodiment, the similarity of a particular sentence is measured by a quantity that is responsive to, or depends upon, the ratio of the overall concentration of the query-word in the plurality of candidate documents to the concentration of the query-word in the sentence. Where there are several query-words, this quantity, which is referred to as the “position-independent sentence similarity,” is summed over all query-words occurring in the particular sentence.
The location, within a do
Foley Hoag & Eliot LLP
KCSL, Inc.
Von Buhr Maria N.
LandOfFree
Method and system for retrieving relevant documents from a... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for retrieving relevant documents from a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for retrieving relevant documents from a... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2888229