Approximate string matching system and process for lossless...

Image analysis – Image compression or coding

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S236000

Reexamination Certificate

active

06535642

ABSTRACT:

BACKGROUND
1. Technical Field
The invention is related to a computer-implemented data compression system and process, and more particularly, to a lossless data compression system and process that employs a unique approximate string matching scheme.
2. Background Art
Lossless compression of signals is often of critical importance. Because of storage limitations—e.g. hard drive size—or bandwidth limitations—e.g. modem speed—the need for compression is often necessary. For many applications, the signal of interest must be compressed losslessly. For example, text or computer programs would be worthless if the retrieved data is not identical to the transmitted (or stored) data. Existing methods do well on this type of data. However, there are other types of data which current methods do very poorly on, but which need to be compressed losslessly. For example, certain types of image signals are both very large (and therefore in need of compression) and contain sensitive details. Examples of such signals include medical (e.g. MRI, X-ray, CT) or Synthetic Aperture Radar (SAR) images, where distortion due to lossy compression could result in missing critical features (e.g. tumors or enemy tanks), and images whose usage requires the maximum possible quality (e.g. images for magazine covers, frames for movies). Other such data signals include sounds sources (e.g. music or speech) or echo recordings (e.g. from geological surveys such as earthquake or oil-finding).
Simple lossless compression is typically performed by entropy coding (examples of such techniques are Huffman or arithmetic coding). In this case a special code is designed which maps symbols in the source to symbols which are transmitted (or stored). An entropy code uses a small number of bits for common symbols in the source, at the expense of using long codes for infrequent symbols.
Current lossless compression schemes take advantage of repeating strings of symbols within a source. For example, the combination of symbols “t”, “h” and “e”, occurs very frequently in English text. These schemes take advantage of this redundancy by decreasing the cost of transmitting or storing such common sequences. Examples of such schemes are the Lempel-Ziv family of source codes, including LZ
77
, LZ
78
, LZW and derivatives [LZ
76
, ZL
77
, Wel
84
], as well as the Prediction by Partial Matching (PPM) algorithm [CW
84
, Mof
90
]. LZW is used as the foundation for the PkZip, gzip and WinZip products. LZ
77
is used in the CompuServe Graphics Interchange Format (GIF) image format. The PPM algorithm (specifically the PPM* algorithm) are generally considered to be at or near the state-of-the-art in the compression of text sources. Both LZ and PPM are so-called universal compression schemes, which means that in the limit they will converge to the maximum amount of compress possible. However, this is only true in the limit—in the case of an infinite amount of data. In practice we are always dealing with finite amounts of data, so the rate of convergence is of critical importance.
However, these existing lossless data compression schemes do not work well when the source alphabet is large, as in the case of sound, images, and other natural signals. They fail because they cannot exploit the redundancy in these signals. This is because in such signals, strings will rarely repeat exactly. As a result, the convergence of these algorithms to optimal compression is very slow. For realistic sources, even large sources on the order of terabytes, these algorithms will fail to reach compression levels near the theoretical optimum.
In natural signals, while data strings will rarely repeat exactly, they will often repeat approximately. For example, in images, a row is often very similar to the row above it. However, it is rarely exactly the same, due to lighting and structural variations. Even in the case of a uniform object with uniform lighting, long repeating strings of pixel values are extremely uncommon due to the noise in the image capturing process. Even a slight variation in the sequence of pixels will prevent the LZ and PPM algorithms from finding and exploiting the underlying redundancy. For example, consider two rows of pixels from a black and white image shown in the following table:
TABLE 1
1
12
4
123
243
253
143
10
3
2
6
3
1
130
252
250
126
7
7
2
Such an array of pixels might arise from an image of a white stripe on a dark background. Using the LZ algorithm or the PPM algorithm on these pixels will result in no compression at all.
Thus, while current lossless data compression methods work well for data characterized by strings that repeat exactly, efforts continue to find a method that can effectively compress data that lacks these identically matching data strings.
It is noted that in the preceding paragraphs the description refers to various publications identified by alphanumeric designators contained within a pair of brackets, e.g. [LZ
76
, ZL
77
, WelB
4
]. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.
SUMMARY
The present invention is directed at a lossless data compression system and process that employs a unique approximate string matching scheme to overcome the problems associated with current compression methods. This system and process includes both an encoder and a decoder.
Generally, the encoding portion of this system and process entails identifying mutually exclusive blocks of data values within the source data being encoded that have lengths, which when represented with the pointer-residual approach described below, exhibit the lowest overall cost for transmitting or storing the source data. These data blocks are referred to as optimal blocks. A block of earlier occurring source data is also identified for each optimal block. These earlier occurring blocks are chosen so that when compared to the optimal block, their difference exhibits the lowest possible entropy. The optimal blocks and their associated blocks of earlier occurring data values are used to generate residual data. This residual data is made up of blocks of data values formed by taking the difference between the values in each corresponding data location of each optimal block and its associated block of earlier occurring source data, starting with the first data location of both blocks. The choice of a block of earlier occurring source data for use in forming a residual data block is based on a cost analysis which is designed to minimize the entropy of differences between the previous block and the new block of source data—at least to a desired degree. The purpose of this difference minimization process is to create blocks of residual data having a highly kurtotic distribution so that most of the data is centered around zero. This distribution makes the block of residual data amenable to commonly used entropy-based compression techniques, such as Huffman, Shannon-Fano, or even higher order techniques such as LZ
77
, LZW and their derivatives, or PPM. In addition to the residual data, a series of pointers are generated. Each pointer is initially affiliated with a separate one of the optimal blocks and identifies the location and length of the block of earlier occurring source data associated with this optimal block. Preferably, prior to compressing the encoded data, each pointer is assigned to the residual data location that corresponds to the location of the first data value of the optimal block to which the pointer was affiliated. The residual data is then compressed using any appropriate entropy-based compression technique, such as those mentioned above. If desired, the pointers can also be compressed using similar techniques at this point in the process. Finally, the compressed residual data and pointers are combined to form a compressed code capable of being transmitted or stored.
The decoding portion of the system and process generally entails, first decompressing the residual data and the pointer (if applicable) using an appropriate decoder for the particul

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Approximate string matching system and process for lossless... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Approximate string matching system and process for lossless..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Approximate string matching system and process for lossless... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3019477

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.