Cross-file pattern-matching compression

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06226628

ABSTRACT:

TECHNICAL FIELD
This invention relates to systems and methods for compressing different groups of files in response to variable requests by clients for such groups of files.
BACKGROUND OF THE INVENTION
Although the amount of available digital information has mushroomed in recent years, limited data storage capacities and data communications bandwidths sometimes threaten the practicality of distributing this information. To deal with storage and bandwidth limitations, the use of data compression has become almost universal.
Various data compression techniques are available, two of the more popular being known as “zip” and “gzip”. Both of these compression techniques utilize some form of pattern matching compression, in which a string beginning at a current data element is represented by referencing a previous, identical string. A well-known example of pattern matching compression (variations of which are used within both zip and gzip) is referred to as “LZ
77
”.
Pattern matching compression involves sequentially examining data elements (also referred to herein as characters) and strings of data elements from a data input stream, and noting any strings that are repetitions of previously encountered identical strings. When the algorithm encounters an occurring string that matches a previously encountered string, the algorithm records two values in place of the occurring string: a length value and a displacement or distance value. The length value indicates the length of the matching strings. The displacement value indicates the number of elements back in the input stream to the previously occurring and matching string.
When the algorithm encounters a data element that cannot be matched to a previously encountered string, the algorithm records the value of the element itself. Such an element is referred to as a “literal” or “literal element.”
Typically, the compressed data stream comprises literals with interspersed length/displacement pairs. A length element is always followed by a displacement element in the compressed data string.
In implementing a compression engine for pattern matching compression, it is usually desired to avoid repeated exhaustive searches of prior data elements. Instead, there is usually some way to record the locations of different strings as they are encountered, to ease the job of finding such strings when processing subsequent characters and strings. In many implementations, one or more lookup tables or hash tables are created and updated as the compression proceeds. A hash table contains a plurality of entries, each pointing to a linked list of previous input stream locations. As the algorithm advances through an input stream, it references the hash table and the linked lists to find previous matching strings, and also updates the table and lists to account for newly encountered data.
As an example, suppose a hash table such as this is indexed by three characters, and that the compression algorithm is attempting to match the string “bdeefis . . .”. Referencing the hash table yields a linked list that leads to all previous strings that begin with the three characters “bde”. The algorithm performs a string compare at locations of all such previous strings, to determine which yields the longest matching string.
As an improvement to this scheme, multiple hash tables are sometimes maintained, corresponding to different match lengths. For example, one hash table might be indexed with three characters, while another is indexed with four. In this case, the hash table and linked lists with the largest number of index characters are referenced first. Tables and lists with smaller numbers of index characters are referenced only if needed.
The preceding discussion is somewhat simplified, but is sufficient for understanding the characteristics of pattern matching compression that are pertinent to the invention. Further details regarding compression techniques can be found in M. Nelson & J. Gailly,
The Data Compression Book,
(2d ed. 1996), which is hereby incorporated by reference. In addition, specifications for the gzip and zip compression techniques can be found in Internet RFCs 1951 and 1952, which are also incorporated by reference.
The scheme described above works well for individual files. Groups of files can also be compressed by concatenating them so that matches can be found across file boundaries. This is referred to as “cross-file” pattern matching compression. Especially for short files, cross-file compression is much more efficient than independently compressing individual files—the longer input stream makes it more likely that matches will be found among the earlier data elements.
In a server environment, or any other environment where files are to be distributed through limited-bandwidth distribution channels, it is generally desired to store files in their compressed formats. This avoids the need for the server to recompress the files every time they are transmitted. If groups of files are to be transmitted, they can be concatenated, compressed using cross-file compression, and stored in their concatenated and compressed state.
In many situations, however, it is not possible to predict which combinations of files will be requested from a server. In these situations, the files must be compressed and stored individually—thus forgoing the advantages of cross-file compression. Alternatively, the files can be stored uncompressed, and then concatenated and compressed (using cross-file compression) in response to client requests. However, this places a tremendous load on the server, since each request requires fresh compression efforts.
SUMMARY OF THE INVENTION
The invention makes use of ancillary files that are stored along with data files. The ancillary files contain preprocessed data relating to characters, strings, and string matches within the individual files.
More specifically, a lookup table is created and stored along with each data file. The lookup table indicates the position in the data file of the last occurrence of each different data value.
In addition, a plurality of displacement indexes are created and stored along with each data file. Each index corresponds to a specific match length, and has entries corresponding respectively to different characters of the data file. For each character, an index indicates a displacement back to a previous matching string of the given match length.
In response to a request for a specific set of files, a server performs cross-file, pattern-matching compression, taking advantage of the data provided in the ancillary files. To find a match for a string beginning with a given character (referred to as the current element or character), the server first utilizes the indexes associated with the current data file to find the largest match within the same file. The server then references the ancillary files associated with previous data files in order to find any larger matching strings in those data files.


REFERENCES:
patent: 4672539 (1987-06-01), Goertzel
patent: 5357431 (1994-10-01), Nakada et al.
patent: 5426779 (1995-06-01), Chambers, IV
patent: 5506580 (1996-04-01), Whiting et al.
patent: 5659730 (1997-08-01), Kelly et al.
patent: 5822746 (1998-10-01), Williams
patent: 5999949 (1999-12-01), Crandall
patent: 6003043 (1999-12-01), Hatakeyama et al.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Cross-file pattern-matching compression does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Cross-file pattern-matching compression, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Cross-file pattern-matching compression will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2453511

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.