Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1999-12-08
2002-10-22
Rones, Charles L. (Department: 2775)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000
Reexamination Certificate
active
06470334
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a document retrieval apparatus, which rapidly retrieves documents including plural words in an order specified by its user by using a relatively small amount of index thereof.
2. Description of the Prior Art
There are known documents retrieval methods for retrieving required documents from within a large amount of documents. One well-known method registers the words included in these documents into an index prior to query-retrieval and uses this index to perform a faster retrieval task.
One example of such a method is retrieval of words from within plural documents. Thus an index is prepared in addition to documents in order to register every word appeared in the documents and pointers to the document that each word is contained in, prior to retrieval. When retrieving, by inputting a word as a retrieval condition, the pointer pointing to the document containing the input word is retrieved from the index to output the appropriate document.
In this method, however, all documents containing the word specified as a retrieval condition will be retrieved, resulting in a problem that many other documents not intended to be retrieved will be included in the retrieval result. Furthermore, narrowing the number of retrieved documents by querying the documents matching with plural words in the retrieval condition does not eliminate the above problem since the relationship between query keywords cannot be specified.
In the Japanese Published Unexamined Patent Application No. Hei 08-249346 discloses a document retrieval apparatus using an adjoining index, which indicates an order or keywords. In accordance with the document retrieval apparatus disclosed as above, a retrieval considering the relationships between two keywords input as query condition may be performed.
The above apparatus generally uses morpheme analysis technology, which has been developed in the field of natural language processing in order to extract words to be registered in an index from the documents to be processed. When using the concurrent morpheme analysis technology, a document may or may not be disassembled into the word strings in an accurate and univocal manner. For example, when performing morpheme analysis on the text “HIRO EN KAIJO GAI (outside banquet site)”, there will be more than one result such as “HIRO | EN | KAIJO | GAI”, “HIROEN | KAIJO | GAI”, “HIRO | ENKAI | JOGAI”, and “HIRO | ENKAIJO | GAI”, where “|” designates to a break between two words. In such analysis, text strings may be split at different breakpoints for the same description.
In the index used in the document retrieval apparatus as described above, since adjoining words may be limited to only one, an index having the structure corresponding to the respective of results of morpheme analysis should be provided, resulting in the index size being enormously large.
Japanese Published Unexamined Patent Application No. Hei 08-249354 discloses a document retrieval apparatus that stores the location of words in a document into the index. In accordance with this document retrieval apparatus, the resulting words may be registered together into an index, even if plural breakpoints are obtained for the same word or different word classes are presumed for this same string.
In this apparatus, there also arise the problem that the number of words to be registered in the index is so enormous that the amount of index cannot be ignored.
The above-described situation may be happen to any natural languages, but it is particularly noticeable in Japanese, in which the breakpoints between words are not clearly articulated when compared to Indo-European languages.
As can be seen from the above description, a document retrieval apparatus of the Prior Art for full text retrieval search using an index requires a large capacity of memory for loading a huge amount of index as well as a long time for index searching and therefore overall retrieval performance may be decreased, This problem may be significant for example in Japanese full text retrieval search, since breakpoints between words are not clear in Japanese. The number of words to be registered in the index will be larger in Japanese than that of Indo-European languages. If an index is to be arranged on a character basis rather than a word basis, in order to avoid the problem of the breakpoint of words, the number of entries to be registered in the index will be so large that the index size will be inflated.
SUMMARY OF THE INVENTION
The present invention has been made in view of the above circumstances and provides a document retrieval apparatus, which performs full text retrieval search of documents by using an index of relatively small amount of size for not only Indo-European documents but also Japanese documents in which breakpoints of words are not clearly articulated.
The present invention also provides a document retrieval apparatus that performs retrieval search, without registered data on the full text of documents, by only using the index by considering the relationship of words, and that outputs the reconstructed full text of documents based on the retrieval result.
The present invention further provides a document retrieval apparatus that stores information on the word class into a small size index, and perform fast retrieval search by using the comparison of the word class information. In other words, the present invention is to provide a document retrieval apparatus that performs fast retrieval search using the index storing the results of morpheme analysis on the documents in its relatively small size.
The document retrieval apparatus in accordance with the present invention has a word storing part that eliminates the redundancy of every word included in a document, and stores these words with additional information on adjoining words next to the word in the document, and a retrieval search part that determines, based on the retrieval criteria including plural words and the disposition of words, the correspondence of the retrieval criteria to plural words stored in the word storing part, in order to check to see whether or not a document matches with the retrieval criteria, i.e., whether or not a document containing the contents corresponding to the input criteria may be retrieved by the retrieval search.
More specifically, the word storing part constituting the index stores said every word by identifying its address in said word storing part, also stores said adjoining words immediately after the word, and additionally stores the addresses of stored adjacent words next to said adjoining words as information on said words in a predetermined order to indicate the word order in a document as the link of addresses in order to eliminate the redundant words to arrange an index of relatively small size.
The document retrieval apparatus in accordance with the present invention may be carried out in a variety of modes. As will be described in the following embodiments, the document retrieval apparatus in accordance with the present invention may be achieved by constituting the index in the word storing part as a trial form, by constituting the index commonly shared for every word in plural documents, by constituting the index so as to store two synonymous words of different forms between original and conjugated forms by connecting them with their addresses, or by storing words in the index with word class information being tagged so as for the document retrieval part to be able to determine the matching to the retrieval criteria based on the criteria including the word class information.
In the document retrieval apparatus in accordance with the present invention, a document output part outputs plural words determined to be matched by the document retrieval part in the order of tracing the addresses in the index so as to restore the documents matching the criteria.
In accordance with the present invention, full text of the documents retrieved may b
Oliff & Berridg,e PLC
Rones Charles L.
LandOfFree
Document retrieval apparatus does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Document retrieval apparatus, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Document retrieval apparatus will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3000064