Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1997-02-03
2002-04-16
Corrielus, Jean M. (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C341S050000, C341S067000, C341S095000, C711S202000, C708S203000
Reexamination Certificate
active
06374250
ABSTRACT:
FIELD OF THE INVENTION
The invention generally relates to the field of data compression. More specifically, the invention relates to techniques, applicable to data which occurs in different versions, for finding differences between the versions.
BACKGROUND OF THE INVENTION
DIFFERENCING ALGORITHMS AND DELTA FILES
Differencing algorithms compress data by taking advantage of statistical correlations between different versions of the same data sets. Strictly speaking, differencing algorithms achieve compression by finding common sequences between two versions of the same data that can be encoded using a copy reference.
The term “file” will be used to indicate a linear data set to be addressed by a differencing algorithm. Typically, a file is modified one or more times, each modification producing a successive “version” of the file.
While this terminology is conventional, differencing applies more generally to any versioned data and need not be limited to files.
A differencing algorithm is defined as an algorithm that finds and outputs the changes made between two versions of the same file by locating common sequences to be copied, and by finding unique sequences to be added explicitly.
A delta file (&Dgr;) is the encoding of the output of a differencing algorithm. An algorithm that creates a delta file takes as input two versions of a file, a base file and a version file to be encoded, and outputs a delta file representing the incremental changes made between versions.
F
base
+F
version
→&Dgr;
(base, version)
Reconstruction, the inverse operation, requires the base file and a delta file to rebuild a version.
F
base
+&Dgr;
(base, version)
→F
version
FIG. 1
is an illustration of the process of creating a delta file from a base file and a version file. A base file
2
and a version file
4
are shown schematically, in a linear “memory map” format. They are lined up parallel to each other for illustrative purposes.
Different versions of a file may be characterized as having sequences of data or content. Some of the sequences are unchanged between the versions, and may be paired up with each other. See, for instance, unchanged sequences
6
and
8
. By contrast, a sequence of one version (e.g., a sequence
10
in the base file) may have been changed to a different sequence in the version file (e.g.,
12
).
One possible encoding of a delta file, shown as
14
, consists of a linear array of editing directives. These directives include copy commands, such as
16
, which are references to a location in the base file
2
where the same data as that in the version file
4
exists; and further include add commands, such as
18
, which are instructions to add data into the version file
4
, the add data instruction
18
being followed by the data (e.g.,
20
) to be added.
In any representation scheme, a differencing algorithm must have found the copies and adds to be encoded. Such other encoding techniques are compatible with the methods to be presented in accordance with the invention.
DIFFERENTIAL ALGORITHMS APPLIED
Several potential applications of version differencing motivate the need for a compact and efficient differencing algorithm. Such an algorithm can be used to distribute software over a low bandwidth network such as a point-to-point modem link or the Internet. Upon releasing a new version of software, the version is differenced with respect to the previous version. With compact versions, a low bandwidth channel can effectively distribute a new release of dynamically self-updating software in the form of a binary patch. This technology has the potential to greatly reduce time to market on a new version, and to ease the distribution of software customizations. For replication in distributed file systems, differencing can reduce by a large factor the amount of information that needs to be updated by transmitting deltas for all of the modified files in the replicated file set.
In distributed file system backup and restore, differential compression would reduce the time to perform file system backup, decrease network traffic during backup and restore, and lessen the storage to maintain a backup image. See U.S. Pat. No. 5,574,906, issued to Robert Morris, titled “System and Method for Reducing Storage Requirement in Backup Subsystems Utilizing Segmented Compression and Differencing”.
The '906 patent describes that backup and restore can be limited by both bandwidth on the network, often 10 MB/s, and poor throughput to secondary and tertiary storage devices, often 500 KB/s to tape storage. Since resource limitations frequently make backing up just the changes to a file system infeasible over a single night or even weekend, differential file compression has great potential to alleviate bandwidth problems by using available processor cycles to reduce the amount of data transferred. This technology can be used to provide backup and restore services on a subscription basis over any network including the Internet.
PREVIOUS WORK IN DIFFERENCING
Differencing has its origins in longest common subsequence (LCS) algorithms, and in the string-to-string correction problem. For examples of the former, see A. Apostolico, S. Browne, and C. Guerra, “Fast linear-space computations of longest common subsequences”, Theoretical Computer Science, 92(1):3-17, 1992 and Claus Rick, “A new flexible algorithm for the longest common subsequence problem”, Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching Espoo, Finland , Jul. 5-7, 1995. For an example of the latter, see R. A. Wagner and M. J. Fischer, “The string-to-string correction problem”, Journal of the ACM, 21(1):168-173, January 1973.
Some of the first applications of differencing updated the screens of slow terminals by sending a set of edits to be applied locally rather than retransmitting a screen full of data. Another early application was the UNIX “diff” utility, which used the LCS method to find and output the changes to a text file. diff was useful for source code development and for primitive document control.
LCS algorithms find the longest common sequence between two strings by optimally removing symbols in both files leaving identical and sequential symbols. (A string/substring contains all consecutive symbols between and including its first and last symbol, whereas a sequence/subsequence may omit symbols with respect to the corresponding string.)
While the LCS indicates the sequential commonality between strings, it does not necessarily detect the minimum set of changes. More generally, it has been asserted that string metrics that examine symbols sequentially fail to emphasize the global similarity of two strings. See A. Ehrenfeucht and D. Haussler, “A new distance metric on strings computable in linear time”, Discrete Applied Mathematics, 20:191-203, 1988.
In Webb Miller and Eugene W. Myers, “A file comparison program”, Software—Practice and Experience, 15(11):1025-1040, November 1985, the limitations of LCS are established, with regard to a new file compare program that executes at four times the speed of the diff program while producing significantly smaller deltas.
In Walter F. Tichy, “The string-to-string correction problem with block move”, ACM Transactions on Computer Systems, 2(4), November 1984, the edit distance is shown to be a better metric for the difference of files, and techniques based on this method enhanced the utility and speed of file differencing. The edit distance assigns a cost to edit operations such as “delete a symbol”, “insert a symbol”, and “copy a symbol”. For example, one longest common subsequence between strings xyz and xzy is xy, which neglects the common symbol z. Using the edit distance metric, z may be copied between the two strings producing a smaller change cost than LCS.
In the string-to-string correction problem given in Wagner et al. (supra), an algorithm minimizes the edit distance to minimize the cost of a given string transformation.
In Tichy (supra), the string-to-string correction problem is adapted to file differencing using the concept of block mov
Ajtai Miklos
Burns Randal Chilton
Fagin Ronald
Stockmeyer Larry Joseph
Brodie R. Bruce
Corrielus Jean M.
Pintner James C.
LandOfFree
System and method for differential compression of data from... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for differential compression of data from..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for differential compression of data from... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2898255