Coded data generation or conversion – Digital code to digital code converters – To or from code based on probability
Reexamination Certificate
2002-05-14
2004-03-02
Young, Brian (Department: 2819)
Coded data generation or conversion
Digital code to digital code converters
To or from code based on probability
C341S065000, C341S106000, C341S063000
Reexamination Certificate
active
06700513
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to the compression of computer data readable by a computer system, and in particular methods of compressing a collection of data blocks that are predefined, relatively independent portions of data. More particularly still, the present invention relates to compressing a collection of data blocks such that the blocks remain independently decompressible, i.e., randomly accessible, while utilizing similarities among the collection of data blocks to achieve better compression.
BACKGROUND OF THE INVENTION
In many distributed environments, many data files, and hence significant amounts of data, are regularly transferred between different computer systems. In general, some transfer connections are faster than others, but it is well known that it takes less time to transfer smaller portions, blocks or files of data than larger ones. For this reason and others, many efforts have been made to reduce file sizes, such as through the use of compression techniques.
Compression relates to the conversion of one set or string of data to another, smaller set or string of data. Importantly however, the second, shorter set of data represents the same content as the first, longer set such that the shorter string may be generally “decompressed” back into the original string. Typical compression algorithms capitalize on redundancy to compress data. In essence, these algorithms replace redundant or often occurring phrases or portions of data with other, shorter versions. In order to accommodate such replacements, some portions of data that are not used often may be replaced with somewhat longer versions of data. However, this tradeoff is acceptable in many scenarios due to the overall or net size savings in reducing redundancy.
One common technique in compressing data relates to the use of Huffman coding. Huffman coding requires an analysis of data to determine probabilities of redundancy for the data. Upon determining relative probabilities for the data, variable length codes may be assigned to the various data phrases. Often, the variable length code assignments may be represented in the form of a “context,” which, in this case, the context is a Huffman tree. Contexts essentially provide a decompressor with the necessary information to decode, or decompress a compressed stream. Huffman trees and other variable-length contexts may be used in combination with other compression methods as well, such as “LZ-77” compression techniques, where literals and pointers may be assigned variable length codes as dictated by the context or Huffman tree.
In operation, there are essentially two ways in which contexts are used. One method relates to the creation of a generic, or stock context that may be used with all sets of data. For instance, a stock, English-alphabet context may be created which assigns variable length codes to letters of the alphabet based on a generic cross-sectional analysis of various documents. Such an analysis may provide short codes to letters “e” and “a” while assigning longer codes to letters “z” and “q” since the former symbols tend to occur more often than the latter symbols. These stock contexts may then be incorporated into a decompressor module and used to decompress any document having English letters. A small identifier sequence may be provided with the compressed data indicating that the data is in English such that the decompressor recognizes which stock context to use. One benefit to the stock context is that only a small portion of code is attached to the compressed code itself such that little overhead is lost in providing information to the decompressor about the type of context, e.g., which Huffman tree to use in decompressing the data. Further, since the decompressor is familiar with the stock context no time or space is lost in communicating either the context itself or how to use the context to the decompressor.
Unfortunately however, one drawback in using stock contexts or Huffman trees relates to the fact that each stock context is based on a generic analysis of many different documents or streams of data. These other documents are typically significantly different from the actual document being compressed and decompressed using the stock context. Consequently, the actual compression ratio is far from optimal, i.e., the resulting compressed data stream is actually larger than it needs to be, and often, the resulting compressed data stream is unsatisfactorily large.
One solution to the problems associated with using stock contexts or Huffman trees relates to using customized contexts for each stream of data being compressed. Customized contexts are created by a method that analyzes the actual data stream or document to be compressed and determines the best use of variable length codes to provide the smallest compressed stream of data for that data stream. For instance, a document written in the English language may frequently use the letter “z” and infrequently use the letter “a” such that the customized context, e.g., Huffman tree may assign a relatively small code to the letter “z” and relatively large code to the letter “a.” In such a case, the resulting compressed stream would be smaller than if the stock context described above was used.
Unfortunately however, customized contexts suffer some drawbacks as well. In particular, when a customized context is created and used, the context itself must be transferred to the decompressor. Typically, the context is appended to the front or back of the compressed stream of data. Attaching the context in this manner increases the overall size of the data to be transmitted, which may increase transmission time. This phenomenon is particularly important when the original stream of data is relatively small. In such a case, the combination of a customized context and the compressed data stream may actually be larger than the original data stream. Stated another way, the compression ratio, where the compression ratio equals the size of the output data stream divided by the size of the input data stream, may be greater than 1 when the customized context is appended to an output data stream for a relatively small input data stream.
Typically, when compressing many smaller blocks of data, concatenating the smaller blocks or objects and compressing them as one decreases the compression ratio. Although compressing the blocks as a single unit decreases the compression ratio, this technique suffers a significant drawback. In particular, in order to access a single one of the blocks of data, the entire concatenated set of blocks must be decompressed. Hence, compression, transfer and decompression time may take too long. Stated another way, there is no way to randomly access a block. For instance, if the eleventh block is needed, the first ten blocks must be decompressed before accessing the eleventh block. This situation is particularly true when the compression algorithm uses an LZ-77-type technique involving interleaved pointers and literals, where the pointers point to redundant elements located in the separate blocks. Consequently, all blocks must be decompressed prior to accessing any one block, i.e., there is no random accessibility.
It is with respect to these and other considerations that the present invention has been made.
SUMMARY OF THE INVENTION
The present invention relates to a system and method that provides a unique or custom context for a plurality of blocks of data, yet compresses the blocks independently from the others, such that each block is independently decompressible. The method analyzes a collection of blocks for compression, and computes a unique context, such as a Huffman tree given the distribution of symbols or phrases across all the blocks in the collection. The unique context is used to decompress the blocks and may be referred to as a “shared decompression context.” Consequently, rather than the decompressor having a fixed or known stock context for all blocks, the collection-specific, shared decompression context is included in an initial transfer and t
Merchant & Gould P.C.
Microsoft Corporation
Nguyen John
Young Brian
LandOfFree
Method and system for compressing and decompressing multiple... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for compressing and decompressing multiple..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for compressing and decompressing multiple... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3278616