Data processing: database and file management or data structures – Database design – Data structure types
Utility Patent
1998-04-14
2001-01-02
Hong, Stephen S. (Department: 2776)
Data processing: database and file management or data structures
Database design
Data structure types
C704S007000, C704S010000
Utility Patent
active
06169999
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a dictionary and index creating system for creating a machine-retrievable dictionary and index available for document managing systems, document editing systems and others which work to manage, edit and retrieve document information accumulated as electronic data, through the use of computers.
2. Description of the Related Art
Recently, owing to the widespread use of word processors, personal computers and large-capacity and low-cost storage media such as CD-ROM and the development of networks represented by Ethernet, the full-text (whole-passage) databases in which the character information in all or most of documents (texts) are expressed as character code strings and accumulated have come into practical and widespread use.
In the case of the prior document databases, the common way for the document retrieval (text search) involves the keyword retrieval making use of keywords prepared for each of documents. However, this way has caused problems such as difficulty in coping with the increase in the accumulated documents because of the troublesome keyword preparation work, the triteness of the keywords with the passage of time and the lack of relevant documents in retrieval result due to the difference in interpretations between the keyword preparing person and the retrieval conducting person. For these reasons, lately, interest has been shown toward the so-called full-text retrieval which does not require the keyword preparation.
The full-text retrieval is of the type performing the collation and matching in character string between the retrieval condition based upon a character string given from the user and a character string constituting the accumulated documents to output a document(s) satisfying the retrieval condition, whereupon there is no need to prepare keywords in advance. So far, various methods have been proposed as means to realize this full-text retrieval. The detailed description of the overall arrangement thereof has been disclosed by, for example, William B. Frakes and Ricardo Baeza-Yates (eds.), “Information Retrieval-Data Structure & Algorithms, Prentice Hall (1992), which is roughly classified into the following three methods from a viewpoint of the index preparation prior to the retrieval to the documents undergoing retrieval or being the target of retrieval (hereinafter referred to as retrieval documents).
(1) Full-text Scan Method
(2) Signature File Method
(3) Transposition File Method
Of these methods, the full-text scan method involves making the matching or collation between the retrieval condition character string and the retrieval documents whenever a question takes place to bring the retrieval result, so that there is no need to previously prepare an index for the retrieval, thus saving the storage capacity and allowing the retrieval under complicated requirements. On the other hand, the retrieval speed is relatively slow as compared with the other methods, and from this viewpoint, the full-text scan method is not fit for a large amount of retrieval.
Furthermore, the signature file method (2) is such that a document file, so-called signature, is constructed in advance as an index for retrieval and this signature file is first retrieved to cut back the quantity of documents undergoing the full-text scanning. In comparison with the above-mentioned method (1), a high-speed retrieval becomes feasible, whereas in general this requires constructing and retaining the signature file constituting several tens %of the capacity of the retrieval documents.
Still further, the transposition file method (3) involves previously constructing as a retrieval index a document in which characters/words
-character succession (n-gram) occur or appear or a transposition file recording the document positions therein so that the retrieval is made through the use of only this transposition file (that is, without the use of the retrieval documents). This method permits an extremely high speed retrieval as compared with the methods (1) and (2). However, in the case that the retrieval documents are written in Japanese, because the boundaries between the words are not clear unlike the western languages, this method requires several times the capacity of the retrieval documents when conducting the retrieval on the basis of the n-character succession.
Since each of the above-mentioned three methods has an advantage and a disadvantage, it is necessary to use them properly to match each of the document retrieval requests. For instance, for the retrieval of an extremely large volume of document including an extremely large number of characters, such as the whole text of an Unexamined Patent Publication, the high-speed retrieval is essential, and in this case, the above-mentioned method (3) is most suitable.
In order to apply the method (3) to a retrieval document based on the no-space languages (there is no space between words) such as Japanese and Chinese, a method of constructing a transposition file of one- or two-character succession to realize a high-speed document retrieval system has been proposed in “A Fast Full-Text Search Method for Japanese Text Database” written by Chuichi Kikuchi, Electronic Information Communication Society Paper Magazine, Vol. J75-D-I, No. 9, pp.836-846 (1992). In addition, a method of constructing a transposition file of one to three-character succession for the preparation of an index when necessary has been proposed in “Development of n-gram Type Large-Scale Full-text Retrieval Method” written by Sugaya, Kawaguchi, Hatayama, Tada, Kato, Information Processing Society of Japan 53rd National Conference Pre-Draft Collection, 3-235, (1996).
However, according to the prior methods, the index file drawn up comes to twice the retrieval documents, and if increasing the number of characters organizing the character succession for the purpose of the speed-up, the capacity of the index file further increases, which creates the problem in that difficulty is encountered to realize them in the case that limitation is imposed on the usable capacity of a memory unit. Moreover, in the case of such a retrieval condition character string as “katakana (characters inherent in Japanese)” with long character strings and many high-frequency character chains, the retrieval data amount in the index file increases, with the result that the retrieval speed reduces.
As one possible way to solve these problems, in the Japanese Unexamined Patent Publication No. 8-249354 there has been disclosed a method in which words are cut out even in the Japanese retrieval documents through the use of a large-scale word dictionary to constitute a transposition file as well as the western languages so that the full-text retrieval is carried out on the basis of an arbitrary retrieval condition character string through the use of the constructed transposition file at high speed. This method will be referred hereinafter to as a prior index retrieval method.
In the prior index retrieval method, a word index storing the occurrence (appearance) positions of character strings respectively matching with words in the retrieval documents and all of only the maximal (longest) index elements of the index elements paired with the words is constructed as a maximal extension index through the use of a word dictionary being a set of a definite number of words (character strings), thereby arranging index information by far smaller than an inverted file of n-character succession (n-gram string) and having a capacity similar to the capacity of the retrieval documents.
In the retrieval, word strings in the dictionary in which each of the characters in a retrieval condition character string is included in at least one of the words is obtained as a cover of the retrieval condition character string, and in terms of each of extension words of each of works including each of words organizing the cover, the set of index elements corresponding to that word are obtained, and of the strings of index element sets corresponding to the
Hamdan Wasseem H.
Hong Stephen S.
Lowe Hauptman Gopstein Gilman & Berner
Matsushita Electric - Industrial Co., Ltd.
LandOfFree
Dictionary and index creating system and document retrieval... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Dictionary and index creating system and document retrieval..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Dictionary and index creating system and document retrieval... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2489166