Approximate prefix coding for data compression

Coded data generation or conversion – Digital code to digital code converters – To or from variable length codes

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06518895

ABSTRACT:

FIELD AND BACKGROUND OF THE INVENTION
The present invention relates to a method of compressing a dataset and, more particularly, to an approximate method of compressing and decompressing fixed length executable code and to a computer that executes such code.
The Huffman code (D. A. Huffman, “A method for the construction of minimum redundancy codes”,
Proc. IRE
vol. 40 no. 9 pp. 1098-1101, September 1952), also called variable-length code or prefix code because of two of its main properties, has been the subject of active research since its invention almost fifty years ago.
The Huffman algorithm sorts the symbols in probability order. Then the two symbols with the lowest probability are joined by a parent node. The probability of the parent node is the sum of the probabilities of the children of the parent node. This procedure continues recursively until a tree is built for all symbols. Each left branch of the tree is assigned a zero bit, and each right branch of the tree is assigned a one bit. The code of a symbol is the sequence of bits obtained by descending from the root of the tree to that symbol.
The average code length L
av
is defined as
L
av
=

i
=
1
n



p
i

l
i
(
1
)
where n is the number of distinct symbols, p
i
is the probability of symbol i, and l
i
is the code length (in bits) assigned to symbol i. Consider, for example, the following string:
“ABCDEFGABCDEFGABCDEABCDEABCDABCABCABABABA BABABABABABABABABABABABABABABABABABABABAA AAAAAAAAAAAAAAAAAA”
This string has seven distinct symbols: “A”, “B”, “C”, “D”, “E”, “F”, and “G”. There are 50 “A”s, 30 “B”s, 7 “C”s, 5 “D”s, 4 “E”s, 2 “F”s and 2 “G”s in the string. The respective probabilities of the seven distinct symbols therefore are 0.5, 0.3, 0.07, 0.05, 0.04, 0.02 and 0.02. With seven distinct symbols, it would take three bits per symbol to encode this string if all symbols were of equal length, i.e., uncompressed.
FIG. 1
shows the Huffman tree for this string and the code assigned to each symbol. The code has the prefix property: no code is the prefix of another code. From equation (1), the average code length is 1.94 bits.
A theoretical aspect of the Huffman code that has been investigated extensively is the redundancy of prefix coding. Another topic that has received considerable attention is the efficient implementation of prefix coding and decoding. Compression based on prefix coding is implemented in software in several popular utilities. The DEFLATE specification (P. Deutsch, “DEFLATE compressed data format specification version 1.3”,
Request for Comments No
1951,
Network Working Group,
May 1996), for example, that is used by programs such as gzip, defines a format for data compression using a combination of the LZ77 algorithm (J. Ziv and A. Lempel, “A universal algorithm for sequential data compression”,
IEEE Transactions on Information Theory
vol. 23 no. 3 pp. 337-343, May 1977) and Huffman coding. The DEFLATE specification uses canonical coding (E. S. Schwartz and B. Kallick, “Generating a canonical prefix coding”,
Communications of the ACM
vol. 7 no. 3 pp. 166-169, March 1964), which helps in two ways. First, the actual codebook used for compression need not be transmitted to the decompressor: the codebook is completely defined by the sequence of bit lengths of the codes for each symbol in alphabet order. Second, canonical coding improves decompression performance by using a set of decoding tables instead of a Huffman tree.
Hardware implementations of Huffman coding (S. Chang and D. G. Messerschmitt, “Designing high-throughput VLC decoders Part I—concurrent VSLI architectures”,
IEEE Transactions on Circuits and Systems for Video Technology
vol. 2 no. 2 pp. 187-196, June 1992; S. M. Lei and M. T. Sun, “An entropy coding system for digital HDTV applications”,
IEEE Transactions on Circuits and Systems for Video Technology
vol. 1 no. 1 pp. 147-155, March 1991) are used in real-time applications such as high-definition television (HDTV). J. H. Jeon et al., in “A fast variable-length decoder using plane separation”,
IEEE Transactions on Circuits and Systems for Video Technology
vol. 10 no. 5 pp. 806-812, August 2000), proposed a variant of the Lei and Sun decoder that considerably shortens processing time. B. J. Shieh et al., in “A high-throughput memory-based VLC decoder with codeword boundary prediction”,
IEEE Transactions on Circuits and Systems for Video Technology
vol. 10 no. 8 pp. 1514-1521, December 2000), described the design of a prefix decoder with codeword boundary prediction. The decompressor predicts the codeword length before the codeword has been fully decoded. The predicted codeword length is used to enhance parallel decoding.
Approximations of Huffman coding also are known. These approximations run faster than true Huffman coding, at the expense of somewhat less efficient compression. The key idea behind the concept of approximate coding is to partition symbols into groups such that all the symbols in the same group are assigned codes with the same length. These groups have been termed sets, packages or classes by different investigators. An approximate Huffman-style coding method (T. M. Kemp et al., “A decompression core for PowerPC,
IBM Journal of Research and Development
vol. 42 no. 6 pp. 807-812, November 1998) has been implemented in IBM's PowerPC 405. A high performance PLA-based decompressor architecture for class-based code has been proposed by S. Weiss and S. Beren in “HW/SW partitioning of an embedded instruction memory decompressor,”
Proc. Int'l Symposium on Hardware/Software Codesign,
pp. 36-41, Copenhagen, Denmark, April 2001.
Early computer architectures (e.g., IBM 360/370, VAX, Intel x86) were designed with variable-length instructions to minimize program space, because of the expense of program memory. By contrast, many RISC architectures designed during the 1980's (e.g., Alpha, PowerPC, Sparc) have fixed-length 32-bit instructions. At the expense of reduced object code density, the use of fixed-length instructions simplifies instruction-level parallel processing and streamlines pipelined hardware design. In embedded system-on-a-chip devices, however, in which the program memory takes up a substantial portion of the chip resources and cost, the tradeoff between object code density and execution efficiency is closer to the pre-RISC situation, and it is advantageous to save resources by compressing the instructions.
The primary requirements for a compression/decompression method and the associated embedded instruction memory decompressor for a system-on-a-chip device are:
1. Efficient compression
2. Coding that facilitates high-performance decompression hardware
3. A small codebook
Efficient compression depends on the choice of the alphabet. Splitting 32-bit instructions into instruction halves rather than bytes produces a large alphabet of 2
16
symbols, but creates an opportunity for more efficient compression. With a large alphabet, the second and third requirements listed above can be achieved by using a form of approximate prefix coding that simplifies decompression and reduces the codebook size.
The approximations of Huffman coding that have been implemented heretofore use ad hoc methods for class partitioning. There is thus a widely recognized need for, and it would be highly advantageous to have, an approximate Huffman coding method that is based on a systematic way of constructing classes.
SUMMARY OF THE INVENTION
According to the present invention there is provided a method of compressing a dataset that includes a number N of distinct symbols, all of the symbols having a common length, including the steps of: (a) ranking the symbols by frequency, thereby assigning to each symbol a respective rank i; (b) selecting a number Q of classes; (c) selecting Q distinct class codes c
j
indexed by an index j such that 1≦j≦Q; and (d) for each symbol: if the rank i of the each symbol is such that 2
q−1
≦i≦2
q
−1 for an integer q≦Q: (i) assigning the class code c
q
to the each symbol, (ii) a

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Approximate prefix coding for data compression does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Approximate prefix coding for data compression, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Approximate prefix coding for data compression will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3168262

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.