Apparatus and method for retrieving character string based...

Image analysis – Image transformation or preprocessing – Image storage or retrieval

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S151000, C382S181000, C382S200000, C382S224000, C382S225000, C382S229000, C382S292000, C358S403000, C704S010000, C707S793000

Reexamination Certificate

active

06507678

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to both a character string retrieval apparatus and a method for registering a plurality of character strings, such as chinese characters, etc. in an array in advance regarding a character string retrieval, and judging whether or not a given character string is registered.
The present invention also relates to both a character code registration retrieval apparatus and a method regarding a key retrieval technology, in particular, for registering character strings, such as Kanji codes being a target to be retrieved using keys in a double array structure being an one-dimensional array of a data structure.
2. Description of the Related Art
Recently, as computer networks, electronic mail, etc. have become widespread, the amount of electronic documents (digital documents) possessed by individuals has rapidly increased. For example, a lot of people receive and process several hundreds to one thousand electronic mails a day. It is not rare that 1 mega-byte (MB) of document data are stored in a day and several hundred mega-bytes to one giga-byte (GB) in a year.
To handle such a large amount of data, it is necessary to reduce the necessary memory capacity and to speed up the transmission of data by omitting redundancy in data and compressing the data amount. The data compression technology has been made indispensable due to the recent trends described above, and for compressing a variety of data by one method, for example, a universal encoding has been proposed.
However, when document data, such as electronicized Japanese, Chinese, etc. are compressed in units of words, first, it is necessary to judge at high speed whether or not a character string inputted from a document is a word registered in a dictionary in advance. Furthermore, since in these languages there are a lot of words to be registered in a dictionary, the dictionary has to be edited in such a way as a useless memory area may not be generated as much as possible. In a well-known TRIE method, a plurality of words being a key are stored in a TRIE dictionary of a tree structure, and a word included in an input character string is retrieved by collating the character string with each node of the tree structure, character by character.
In the following description, names used in an information theory are used as they are, that is, data in one word unit are called a symbol or character, and an arbitrary number of connected data are called a string or character string. Furthermore, a sequence consisting of several leading symbols and characters in a code string or character string is called a prefix, and a sequence consisting of several ending symbols and characters is called a suffix. For example, the prefixes of a character string abc are &egr; (empty), a, ab and abc, and the suffixes are &egr;, a, ab and abc.
In the compression of language codes it is important to store a string, such as a word, etc. in a data structure with a memory capacity as small as possible, and develop an algorithm to retrieve the string at high speed. In particular, in the case of a dictionary storing words, key aggregates to be registered are known in advance, and the dictionary is often expanded by suitably adding keys later. Therefore, it is also important that keys can be easily added. Such a data structure is called a quasi-static data structure.
Aoe has proposed a double-array as a data structure for pattern-matching a plurality of keys at high speed (Junichi Aoe: “A High-speed Digital Retrieval Algorithm by Double-array”, in Proceedings of Papers D of The Electronics Information and Communications Institute, Vol.J71-D, No.9, pp.1,592-1600, 1988).
FIG. 1A
shows an example of a double-array. This double-array comprises two one-dimensional arrays of BASE and CHECK, and data stored by these arrays corresponds to a TRIE structure shown in FIG.
1
B. The TRIE of
FIG. 1B
indicates the five English words of baby #, bachelor #, badger #, badge # and jar #, and the index of each node corresponds to the subscripts of the arrays of BASE and CHECK shown in
FIG. 1B. A
position where the registration values of BASE and CHECK are both 0, corresponds to a space position where nodes are not yet registered.
This TRIE includes a repeat of the parental relation of nodes shown in
FIG. 1C
, and the index n of a parent node and the index m of a child node correspond to the subscripts of a BASE and a CHECK, respectively. In other words, this parental relation indicates a kind of state transition, and when a character a is inputted in the state of a parent node n, the transition from the state of a parent node n to the state of child node m is made.
When the index of a child node corresponding to the character a following the parent node n is retrieved using a double-array, first, as shown in
FIG. 1D
, a position corresponding to the subscript n on a BASE is referred to and the content d is obtained. This value d indicates a kind of origin shift amount (displacement amount) for the subscript of the CHECK.
Then, the subscript of a position shifted by the internal representation value of the character a, with the subscript d on the CHECK as a start point, is assumed to be m (=d+the internal representation value of character a). If the content of a position corresponding to the subscript m on the CHECK coincides with the index n of the parent node, the character a is stored below the node n, and it is found that the subscript of a corresponding child node is m. At this time, the index m of the child node is expressed as m=g(n,a) using a goto function g specifying a state transition for a key on a TRIE.
Generally speaking, one or more child nodes are following one parent node, and in a normal TRIE structure, the retrieval speed of a child node is reduced according to the number of the sibling nodes following the same parent node. On the other hand, in the double-array TRIE structure, a high-speed retrieval is available regardless of the number of sibling nodes.
However, the conventional character string retrieval described above has the following problems.
When a double-array is used for a Kanji dictionary of Japanese, Chinese, etc., the number of child nodes following one parent node tends to increase compared with an alphabetical dictionary of English, etc. due to the variety of Kanji idioms.
FIG. 1E
shows a case where five Kanji idioms starting with a Kanji “
” (electricity), that is, “
” (voltage), “
” (electricity), “
” (electric train), “
” (computer) and “
” (telephone) are registered in a double-array. In this case, a Kanji code value corresponds to each of the characters following “
”, that is, “
” (pressure), “
” (atmosphere), “
” (train), “
” (brain) and “
” (speech), and a relative positional relation is kept constant on a CHECK according to the internal representation values. On the other hand, positions marked with O on the CHECK are already occupied by other Kanji characters, and the respective Kanji following “
” cannot be necessarily simultaneously matched for an empty position.
Therefore, in order to register these Kanji characters on the CHECK with the relative positional relation maintained, as shown in
FIG. 1F
, it is necessary to expand both arrays of BASE and CHECK. In this case, the minimum displacement amount (parallel shift amount) d which can accommodate all these Kanji characters is calculated, and this value d is written in a position of the code value n of “
” on the BASE. Here, values obtained by adding the internal representation value of each of the Kanji characters following “
” to this displacement amount d are designated for new subscripts of the array, p, q, r, s and t. Then, the index n of the parent node of “
” is written in the positions of p, q, r, s and t on the CHECK.
FIG. 1G
shows this TRIE tree structure. In
FIG. 1G
, “
” is registered below the root node, and “
”, “
”, “
”, “
” and “
” are registered below the node n corresponding to nodes p, q, r, s and t, respectively. Here, n=g(root,
),

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Apparatus and method for retrieving character string based... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Apparatus and method for retrieving character string based..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus and method for retrieving character string based... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3026925

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.