Method and device for document retrieval

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06546383

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to a device, a method, and a memory medium having a program embodied therein for document retrieval.
2. Description of the Related Art
Document-retrieval techniques retrieve documents including a query character string from a document database. One of such document-retrieval techniques is a likely-relevance retrieval scheme, which retrieves documents that include character strings resembling a query character string.
The likely-relevance retrieval technique is disclosed, for example, in the Japanese Patent Laid-open Application No. 11-85776. This technique calculates ranking scores of partial character strings that are part of a query character string based on the frequency of occurrences, and searches for the query character string in the document by using the obtained ranking scores.
Another example of the likely-relevance retrieval technique is found in “Development and Evaluation of Full-Document-Based Retrieval System ‘Retrieval Express’,” Proceedings of the Third Annual Meeting of the Association for Natural Language Processing, pp. 361-364, March, 1997. This technique obtains frequency of occurrences of a query character string in a document by obtaining all positions of such occurrences in the document based on occurrences of partial character strings, and calculates a ranking score of the query character string in respect of the document.
The technique disclosed in the above patent laid-open application, however, merely searches for a query character string in a single document, and cannot be used to retrieve a document including a query character string from a plurality of documents.
Further, the longer the query character string, the larger the number of partial character strings that are to be taken into account in the search. Also, the longer the query character string, the larger the number of document segments that are to be processed for calculation of ranking scores. This results in an increase in retrieval time. For example, when a query character string is “ABCDEF” (each capital letter represents a single Japanese character for the sake of explanation), and partial character strings each comprised of 2 characters are used as a unit of processing, one can extract five partial character strings, i.e., “AB”, “BC”, “CD”, “DE”, and “EF”. In general, when a query character string is comprised of m characters, and n characters constitute a unit of processing, one can extract (m−n+1) partial character strings. Since the ranking score needs to be calculated at every position where at least one of extracted partial character strings appears, the number of positions that require computation increases as the number of partial character strings increases.
A ranking score of a partial character string in the document is calculated based on frequency of occurrences of the partial character string in the document. Some of the partial character strings appearing in the document may have no bearing on the query character string, yet such occurrences are counted toward the ranking scores. This reduces accuracy of the search. For example, the query character string “ABCDEF” may appear only once in a given document, and another character string “WXYZEF” that has a totally different meaning may appear many times in this document. In such a case, the partial character string “EF” appears as many times as the number of occurrences of “ABCDEF” plus the number of occurrences of “WXYZEF”. As a result, the ranking score of the partial character string “EF” ends up being inappropriately high despite the rare occurrence of the query character string, resulting in an inappropriately high ranking score for the query character string.
Another problem is that search cannot be conducted if the length of a query character string is shorter than a unit of processing. This is because the query character string cannot be divided into partial character strings having the length of the unit of processing. For example, if the query character string is “B”, and two characters constitute a unit of processing, the search of this method cannot be performed since the query character string is shorter than the unit of processing.
The technique disclosed in “Development and Evaluation of Full-Document-Based Retrieval System ‘Retrieval Express’,” Proceedings of the Third Annual Meeting of the Association for Natural Language Processing, pp. 361-364, March, 1997 has the same problem as the technique disclosed in the above patent laid-open application. That is, the amount of computation for counting occurrences of a query character string in a document increases as the length of the query character string increases, resulting in lengthening of a processing time for document retrieval. The larger the number of occurrences of a query character string, the more conspicuous the increase in the processing time for document retrieval.
Accordingly, there is a need for a retrieval scheme that can retrieve a document easily at high speed.
There is another need for a retrieval scheme in which the computation load of selecting a document and calculating ranking scores can be reduced, thereby achieving high-speed processing.
There is another need for a retrieval scheme that is free from an influence of other character strings having no relevance to a query character string, thereby improving retrieval accuracy.
There is another need for a retrieval scheme in which the computation load of obtaining positions of occurrences of a query character string can be reduced, thereby achieving high-speed document retrieval.
There is another need for a retrieval scheme in which the number of score searches can be reduced, thereby boosting a search speed.
There is another need for a retrieval scheme that can retrieve a document even if the length of a query character string is shorter than a unit of processing.
There is another need for a retrieval scheme in which the computation load of calculating ranking scores is reduced, thereby achieving high-speed retrieval.
SUMMARY OF THE INVENTION
It is a general object of the present invention to provide a document-retrieval scheme that substantially obviates one or more of the problems caused by the limitations and disadvantages of the related art.
Features and advantages of the present invention will be set forth in the description which follows, and in part will become apparent from the description and the accompanying drawings, or may be learned by practice of the invention according to the teachings provided in the description. Objects as well as other features and advantages of the present invention will be realized and attained by a method and a device for document retrieval particularly pointed out in the specification in such full, clear, concise, and exact terms as to enable a person having ordinary skill in the art to practice the invention.
To achieve these and other advantages and in accordance with the purpose of the invention, as embodied and broadly described herein, the invention provides a method for document retrieval comprising the steps of dividing a query character string into partial character strings, selecting one or more documents from a plurality of registered documents such that the one or more documents each include all the partial character strings, computing respective scores of the partial character strings for each of the one or more documents, and computing a score of the query character string from the respective scores of the partial character strings for each of the one or more documents.
In the method described above, the one or more documents that include the partial character strings resembling the query character string are selected prior to the computation of scores. Because of this screening process, the high-speed document retrieval can be achieved to retrieve a document from the plurality of registered documents.
According to one aspect of the present invention, the method as described above is such that the step of dividing divides the query

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and device for document retrieval does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and device for document retrieval, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and device for document retrieval will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3035128

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.