Electrical computers and digital processing systems: multicomput – Computer-to-computer data modifying – Compressing/decompressing
Reexamination Certificate
2000-08-31
2004-12-14
Jean, Frantz B. (Department: 2151)
Electrical computers and digital processing systems: multicomput
Computer-to-computer data modifying
Compressing/decompressing
C709S246000, C709S203000, C707S793000
Reexamination Certificate
active
06832264
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to methods and systems for data coding, and specifically to methods of data compression.
BACKGROUND OF THE INVENTION
Compression algorithms based on textual substitution are widely used in storage and transmission of data files. The classic method in this class is the Lempel-Ziv algorithm, described by Ziv and Lempel in “A Universal Algorithm for Sequential Data Compression,” published in
IEEE Transactions on Information Theory
23(3), pages 337-343 (1977), which is incorporated herein by reference. These algorithms are based on finding textual resemblance between different segments in a file. The later occurrence of a given segment is then replaced by a pointer to the earlier occurrence.
Textual substitution algorithms are universally applicable, in the sense that they require no a priori knowledge of the contents of the file. In the asymptotic limit, such algorithms are capable of attaining the best compression possible for a given file. By their nature, however, algorithms known in the art start to gain compression only after an initial portion of the file (sometimes of substantial length) has been processed, and the degree of compression that is achieved grows slowly.
Wyner and Ziv studied the effectiveness of compression using a fixed, predetermined reference string, external to the file that is to be compressed, in a later article, entitled “Fixed Data Base Version of the Lempel-Ziv Data Compression Algorithm,” published in
IEEE Transactions on Information Theory
37(3), pages 878-880 (1991), which is incorporated herein by reference. They showed that asymptotic results similar to those of the classic algorithm can be achieved, provided that the reference string is produced by the same source that generates the file to be compressed. The authors make no suggestion, however, as to how the reference string should be chosen.
SUMMARY OF THE INVENTION
In preferred embodiments of the present invention, a target file is compressed by matching segments in the file to corresponding segments in a reference file or set of files. The present invention exploits the fact that typically, many computers have the same set of common files already resident on their disks, such as help files, program files and other resources supplied by a given vendor or other source. New files supplied from the same source or based on the same set of resources frequently reuse substantial, identical segments of the earlier files. By using these pre-existing, common resources as reference files, it becomes possible to match very long substrings from the target file to appropriate segments in the reference files and thus to achieve compression superior to methods known in the art.
In some preferred embodiments of the present invention, a server compresses a target file to be conveyed to a client based on a common set of reference files shared by the server and the client. The server is typically aware in advance of the reference files held by the client, which typically include the client's operating system files (even if different from the server's operating system) and other software platform components. Alternatively or additionally, the server may derive this information from a preliminary communication with the client. The server codes the target file as a list of specifiers, or pointers, to segments in the client's reference files that match successive substrings in the target file. Each specifier preferably includes a reference file identifier and an offset and length of the segment in the reference file. Substrings in the target file that do not have a match of sufficient length in any of the reference files are preferably added to the pointer list as is, most preferably with a flag to indicate that they are uncoded.
Preferably, the server compresses the coded list by encoding the specifiers in the list. The list of specifiers typically draws on a small subset of the total corpus of reference files. Therefore, the reference file identifiers can be encoded efficiently based on the frequency of their occurrence, using Huffman coding, for example. The server preferably adds a header to the resultant compressed file, identifying the reference files that it has used and their respective codes.
In some preferred embodiments of the present invention, the server maintains multiple sets of reference files, typically corresponding to different client platforms, in order to generate different compressed versions of the target file depending on the client to which the compressed file is to be sent. Alternatively or additionally, the computer prepares and caches the different versions in advance. It will be understood that while preferred embodiments are described herein with reference to a client/server architecture, peer computers may also compress, store and/or transmit to one another efficiently-compressed files based on the principles of the present invention.
In preferred embodiments of the present invention, in order to decompress the file, the client reads the header and opens the appropriate reference files on its own disk. It then processes the list of pointers, retrieving the successive segments from the reference files that are indicated by the specifiers. These segments are concatenated in order to reconstruct the target file, along with any uncoded substrings as appropriate.
Although preferred embodiments are described herein with reference to compression and transmission of files, it will be understood that the principles of the present invention are equally applicable to compression of other bodies of data that can be characterized as strings of characters or other symbols. Therefore, in the context of the present patent application and in the claims, the terms “file,” “target file” and “reference files” should be taken to refer generally to substantially any suitable bodies of computer-readable data to which the principles of the present invention may be applied. Similarly, the terms “substrings” and “segments” are used herein for convenience and clarity to denote strings of symbols within the target file and reference files, respectively, and should be understood to refer to strings of substantially any length and suitable type within the bodies of data in question.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for compressing a target string of symbols, including:
identifying a set of reference strings stored by a computer;
matching a plurality of successive substrings in the target string to respective segments found in one or more of the reference strings;
assigning respective segment specifiers to the substrings that identify the respective segments to which they are matched; and
outputting an ordered list of the specifiers.
Preferably, identifying the set of reference strings includes identifying a set of files used by the computer. Preferably, the files are associated with an operating platform of the computer. Alternatively or additionally, the files include at least one of a program file and a help file.
Preferably, matching the substrings includes, for each of the substrings, finding the respective one of the segments in the reference strings so as to maximize a length of the substring. Further preferably, matching the substrings includes encoding the reference strings in a tree structure having nodes connected by edges corresponding to the symbols in the segments, and finding the respective one of the segments includes traversing in succession the edges of the tree that correspond to the symbols in the substring. Most preferably, traversing the edges of the tree includes traversing the tree up to one of the nodes reached by a last one of the edges in the succession, and assigning the respective segment specifiers includes returning a node specifier associated with the one of the nodes.
Additionally or alternatively, assigning the respective segment specifiers includes assigning the specifiers only to the substrings that are matched by segments of
Factor Michael
Sheinwald Dafna
Darby & Darby
Jean Frantz B.
LandOfFree
Compression in the presence of shared data does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Compression in the presence of shared data, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Compression in the presence of shared data will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3303012