Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-12-12
2003-07-15
Mizrahi, Diane D. (Department: 2175)
Data processing: database and file management or data structures
Database design
Data structure types
C704S009000
Reexamination Certificate
active
06594658
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates to a method and apparatus for generating responses to queries to a document retrieval system. When a large corpus (database) of documents is searched for relevant terms (query terms), it is desirable to find small relevant passages of text (called “hits” or “hit passages”) and rank them according to an estimate of the degree to which they will providing the information sought.
If the document database is very large, the number of hit passages generated may be far too high to be helpful to the user. Mechanisms are needed to minimize the number of hit passages that a user must examine before he or she either has found the desired information or can reasonably conclude that the information sought is not in the collection of texts.
This type of specific, “fine-grained” information access is becoming increasingly important for on-line information systems and is not well served by traditional document retrieval techniques. The problem is exacerbated with the use of small queries (of only a few words), which tend to generate larger numbers of retrieved documents.
When both the query and the size of the target (hit) passage are small, one of the challenges in current systems is that of dealing effectively with the paraphrase variations that occur between the description of the information sought and the content of the text passages that may constitute suitable answers. Literal search engines will not return paraphrases, and therefore may miss important and relevant information. Search engines that allow paraphrases may generate too many responses, often without an adequate hierarchical ranking, making the query response of minimal usefulness.
Thus, another challenge which is not currently well met is the effective ranking of the resulting hit passages. A high-quality ranking of matching document locations in response to queries is needed to enhance efficient information access.
Classical information retrieval (also called “document retrieval”) measures a query against a collection of documents and returns a set of “retrieved” documents. A useful variant (called “relevance ranking”) ranks the retrieved documents in order of estimated relevance to the query, usually by some function of the number of occurrences of the query terms in the document and the number of occurrences of those same terms in the collection as a whole.
Document retrieval techniques do not, however, attempt to identify specific positions or passages within the retrieved documents where the desired information is likely to be found. Thus, when a retrieved document is sufficiently large and the information sought is specific, a substantial residual task remains for the information seeker; it is still necessary to scan the retrieved document to see where the information sought might be found, if indeed the desired information is actually present in the document. A mechanism is needed to address this shortcoming.
In most previous information retrieval procedures for passage retrieval, a passage granularity is chosen at indexing time and these units are indexed and then either retrieved as if they were small documents or collections of individual sentences are retrieved and assembled together to produce passages. See Salton et al., “Approaches to Passage Retrieval in Full Text Information Systems,”
Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
(
SIGIR
93) (incorporated herein by reference), ACM Press, 1993, pp 49-58; Callan, J. P., “Passage-Level Evidence in Document Retrieval,”
Proceedings of the Seventeenth Annual International ACM
-
SIGIR Conference on Research and Development in Information Retrieval
(
SIGIR
93) (also incorporated herein by reference), Springer-Verlag, 1994, pp 302-310; and Wilkinson, R., “Effective Retrieval of Structured Documents,” (also in
Proceedings of the Seventeenth
, etc., at pp 311-317). It would be useful to have a system that dynamically sized passages for retrieval based upon the degree to which the retrieved passage matches the query phrase.
Recently, a different approach has been proposed, based upon hidden Markov models and capable of dynamically selecting a passage. See Mittendorf et al., “Document and Passage Retrieval Based on Hidden Markov Models,” (
Proceedings of the Seventeenth
, etc., pp 318-327). However, this approach does not deal with the entire vocabulary of the text material, and requires reducing the document descriptions to clusters at indexing time. It would be preferable to have a system that both encompasses the entire text base and does not require such clustering.
SUMMARY OF THE INVENTION
The present invention is directed to a method and apparatus for generating responses to queries with more efficient and useful location of specific, relevant information passages within a text. The method locates compact regions (“hit passages”) within a text that match a query to some measurable degree, such as by including terms that match terms in the query to some extent (“(entailing) term hits”), and ranks them by the measured degree of match. The ranking procedure, referred to herein as “relaxation ranking”, ranks hit passages based upon the extent to which the requirement of an exact match with the query must be relaxed in order to obtain a correspondence between the submitted query and the retrieved hit passage. The relaxation mechanism takes into account various predefined “dimensions” (measures of closeness of matches), including: word order; word adjacency; inflected or derived forms of the query terms; and semantic or inferential distance of the located terms from the query terms.
The system of the invention locates occurrences of terms (words or phrases) in the texts (document database) that are semantically similar to terms in the query, so as to identify compact regions of the texts that contain all or most of the query terms, or terms similar to them. These compact regions are ranked by a combination of: their compactness; the semantic similarity of the located phrases to the query terms; the number of query terms actually found (i.e. matched with some located term from the texts); and the relative order of occurrence of the located terms compared with the order or the corresponding query terms.
The identified compact regions are called “hit passages,” and their ranking is weighted to a substantial extent based upon the physical distance separating the matching terms (compared with the distance between the corresponding terms in the query), as well as the “similarity” distance between the terms in the hit and the corresponding terms in the query.
The foregoing criteria are weighted and the located passages are ranked based upon scores generated by combining all the weights according the a predetermined procedure. “Windows” into the documents (variably sized regions around the located “hit passages”) are presented to the user in an order according to the resulting ranking.
A significant advantage of relaxation ranking is that the system automatically generates and ranks hits that in a traditional document retrieval system would have to found by a sequence of searches using different combinations of retrieval operators. Thus, the number of times the information seeker is unsatisfied by a result—and therefore needs to reformulate the query—is significantly reduced, and the amount of effort required to formulate the query is also significantly reduced.
Another advantage is that the rankings produced by the current system are for the most part insensitive to the size or composition of the document collection and are meaningful across a group of collections, so that term hit lists produced by searching different collections can be merged, and the ranking scores from the different collections will be commensurate. This makes it possible to parallelize and distribute the indexing and retrieval process.
In addition, the system of the invention is more successful than traditional system at locating specific, relevant passages within
Finnegan Henderson Farabow Garrett & Dunner L.L.P.
Mizrahi Diane D.
Spiegel Michael
Sun Microsystems Inc.
LandOfFree
Method and apparatus for generating query responses in a... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for generating query responses in a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for generating query responses in a... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3091536