Data compressing apparatus, reconstructing apparatus, and...

Image analysis – Pattern recognition – Context analysis or word recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S232000, C382S233000, C382S239000, C382S247000, C341S050000, C341S051000, C341S106000, C341S107000, C358S426010, C707S793000, C707S793000, C348S384100, C708S203000

Reexamination Certificate

active

06542640

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates to data compressing apparatus, reconstructing apparatus, and its method for compressing and reconstructing document data. More particularly, the invention relates to data compressing apparatus, reconstructing apparatus, and its method for compressing and reconstructing document data formed by character codes of a language such as Japanese, Chinese, Hangul, or the like having a word structure which is not separated by spaces.
In recent years, various data such as character codes, image data, and the like is dealt in a computer. Further, in association with the spread of the internet and intranet, the number of mails and electronized documents is increasing. In such a large amount of data, by compressing the data by omitting redundant portions in the data, a storage capacity can be reduced or the compressed data can be sent to a remote place in a short time. The field of the invention is not limited to the compression of character codes but can be applied to various data. The denominations which are used in the information theory are adopted, one word unit of data is called a character, and data in which an arbitrary plurality of words are connected is called a character train hereinbelow.
As data compression, there are a dictionary type coding using similarity of a data series and a probability statistic type coding using appearance frequency of only data. The dictionary type coding is a method whereby a character train is replaced to a registration number of a dictionary and a character train is registered in a manner such that as the appearance frequency of the character train is higher, the longer the character train is registered in the dictionary, thereby obtaining a high compression ratio. As a typical method of the dictionary type coding, there are LZ77 and LZ78 (for example, refer to Tomohiko Uematsu, “Document data compression algorithm handbook”, CQ publisher). According to LZ77, a buffer of a predetermined amount is provided and a position and a length of a character train of the longest line which coincides in the buffer are encoded. On the other hand, according to LZ78, a character train appeared in the past is registered in a dictionary and a registration number is encoded. The probability statistic type coding algorithm is a method of obtaining a high compression ratio by allocating a short code length to a character having a high appearance frequency in accordance with a statistic appearance frequency of each character. As a typical probability statistic type coding, there are an arithmetic coding (for example, refer to Ian H. Witten et al., “Arithmetic Coding for Data Compression”, Commun. of ACM, Vol. 130, No. 6, pp. 520 to 540) and a Huffman coding (for example, refer to Donald E. Knuth, “Dynamic Huffman Coding”, Journal of Algorithms, Vol. 6, pp. 163-180).
In order to obtain a further compression effect, a coding using a context collecting unit
200
and a variable length coding unit
202
in
FIG. 1
for variable length coding on the basis of, not an appearance probability of each character, but a conditional appearance probability in which a context expressing a dependence relation between an input character and a character just before the input character is taken has been proposed. The method whereby the variable length coding is performed by using the conditional probability in which the context is taken is called a context model. The context and a coding target character are expressed by a tree structure of
FIG. 2B
when input characters of three characters of a, b, and c in
FIG. 2A
are used as an example. The tree structure is called a context tree and the number of times of appearance is counted at each node each time the character train which passes the character of each node appears, thereby obtaining the conditional probability.
There are three kinds of LZ78 systems and probability statistic type codings irrespective of an actual appearance frequency of a non-compression data train.
I. a static coding for dividing in accordance with a preset appearance frequency;
II. a semi-adaptive coding for dividing in accordance with an appearance frequency obtained by first scanning all of character trains; and
III. an adaptive coding for recalculating a frequency each time a character appears and dividing in accordance with the recalculated appearance frequency.
In a compression which doesn't restrict the kind of non-compression data train, the semi-adaptive coding or the adaptive coding is used.
According to the conventional semi-adaptive coding and adaptive coding, when large data of about a few Mbytes is compressed, since a code adapted to the non-compression data train can be allocated, a high compression ratio can be obtained. In case of compressing small data of about a few kbytes, however, since every character train appears only about a few times, a code adaptive to a statistic appearance frequency cannot be allocated, so that a high compression ratio cannot be obtained by the semi-adaptive coding and the adaptive coding. On the other hand, in the static coding for dividing in accordance with the preset appearance frequency, although a constant compression ratio can be obtained irrespective of a data size, since the number of preset codes is fixed to one, there is a problem that a high compression ratio cannot be obtained with respect to data having a statistic amount different from the prepared code. Especially, when small data of about a few kbytes of document data of a language such as Japanese, Chinese, Hangul, or the like in which one character is expressed by word data of two bytes is compressed, a compression effect can be hardly expected by the conventional codings. There is also a case where the data amount after compression increases depending on a document. Further, the conventional codings have a problem that since a process is executed on a byte unit basis, the process is complicated and it is difficult to realize a high processing speed.
SUMMARY OF THE INVENTION
According to the invention, there are provided data compressing apparatus, reconstructing apparatus, and its method which can compress and reconstruct even data of a small kbyte order at a high speed while holding a high compression ratio.
First Embodiment
A target of the invention is a data compressing apparatus for compressing non-compression data formed by character codes of a language having a word structure which is not separated by spaces. As a language having the word structure which is not separated by spaces, for example, there are Japanese, Chinese, Hangul, and the like. Such a data compressing apparatus (basic apparatus) is characterized by comprising: a character train dictionary storing unit for storing a dictionary in which character trains each serving as a processing unit at the time of compression have been registered; a character train comparing unit for detecting the partial character train which coincides with the registration character train by comparing the registration character train in the character train dictionary storing unit with a partial character train in the non-compression data; and a code output unit for allocating a predetermined character train code every partial character train in which the coincidence has been detected by the character train comparing unit and outputting.
When considering Japanese as an example, there is a study result of Japan Electronic Dictionary Research Institute (EDR) Co., Ltd. regarding Japanese words (Yokoi, Kimura, Koizumi, and Miyoshi, “Information structure of electronic dictionary at surface layer level”, the papers of Information Processing Society of Japan, Vol. 37, No. 3, pp. 333-344, 1996). In the study result, morphemes constructing Japanese, that is, parts of speech of words are added up. When words are simply classified into parts of speech class and the parts of speech class are registered, the number of parts of speech class is equal to 136,486 and they can be expressed by codes of 17 bits (maximum 262,143). The number of characters const

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Data compressing apparatus, reconstructing apparatus, and... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Data compressing apparatus, reconstructing apparatus, and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Data compressing apparatus, reconstructing apparatus, and... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3038285

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.