Method for identifying longest common substrings

Coded data generation or conversion – Digital code to digital code converters

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details Method for identifying longest common substrings Method for identifying longest common substrings

: 2001-03-30
: 2002-03-19
: Young, Brian (Department: 2819)
: Coded data generation or conversion
: Digital code to digital code converters

: C341S051000
: Reexamination Certificate
: active
: 06359574
: ABSTRACT:

FIELD OF INVENTION
The present invention relates to differential data compression techniques in general and in particular to methods for identifying the longest common substring of two strings.
BACKGROUND OF THE INVENTION
Compression of data has long been used for two distinct purposes, to reduce the amount of storage space required to hold data on a storage medium and to reduce the number of bits that must be sent over a communications link to transmit the data One well-known method for compression is to represent the data to be compared in teens of its differences from some reference set of data. Multiple occurrences of a given pattern wit the data to be compressed are replaced with a shorter sequence of data acting as a placeholder for the pattern. A special case of this approach is run-length encoding, which entails abbreviating repeated consecutive occurrences of a given bit pattern by a single occurrence of that pattern plus a count of the number of times the pattern is repeated
Compression encoding techniques that find and encode differences between file versions are known as “delta compression” or “differential compression” techniques. Well known differential compression techniques include the Tichy Block-Move algorithm and the VDELTA algorithm, which may be though of as a combination of the Block-Move and Lempel-Ziv algorithms.
The effectiveness of a compression technique, typically expressed as a factor by which compression reduces the length of the dam often depends on the nature of the data to be compressed, and a method designed for one kind data is not, in general, as effective when applied to other kinds of data. For example, some compression methods that are quite effective when applied to text files are often far less effective when applied to video images, and vice versa. In one technique well-suited for text compression, a target string or symbol sequence on a first computer is compared to a reference string or symbol sequence to identify the longest common substring (LCS) of symbols which are common to both symbol sequences. A sting v is a substring of a string u if u=u′vu″ for some prefix u′ and suffix u″. The target symbol sequence may be reconstructed at a second computer by transmitting a representation of the target symbol sequence including indices identifying the LCS and other substrings unique to the target symbol sequence, rather than transmitting the target symbol sequence itself provided that the second computer also has a copy of the same reference symbol sequence.
The two main challenges facing LCS-based differential compression techniques have been finding an efficient method of identifying LCSs common to target and reference symbol sequences and finding an efficient method of representing the target symbol sequence in terms of LCSs and unique substrings. Unfortunately, LCS techniques known in the art generally apply a generic, byte-by-byte comparison in a single processing thread and with little or no regard to certain file characteristics that might otherwise reduce the number of comparison required
The following U.S. Patents are believed to be representative of the current state of the art of differential data compression methods and apparatus: U.S. Pat. Nos. 5,850,565, 5,977,889, 6,012,063, and 6,104,323.
SUMMARY OF THE INVENTION
The present invention provides a novel method for identifying the LCS of a target symbol sequence and a reference symbol sequence such as may be used in support of differential data compression.
In one aspect of the present invention a method for identifying a longest common substring for a string T and a string R is provided including selecting a registration symbol that appears in both strings R ad T, constructing a first relative distance vector R′ from the appearance of the registration symbol in the string R, constructing a second relative distance vector T′ from the appearance of the registration symbol in the string T, deriving a substring pair R
CS
and T
CS
in the strings R and T respectively for each common substring pair R′
CS
and T′
CS
in the vectors R′ and T′ respectively, and identifying the longest matching of the R
CS
and T
CS
substring pairs as the longest common substring for the string T and the string R.
In another aspect of the present invention the selecting step includes selecting the registration symbol from a group of symbols that appear in either of the strings R and T an average number of times as compared with all other symbols that appear in either of the strings R and T.
In another aspect of the present invention the selecting step includes selecting the most randomly distributed symbol among the group of symbols as the registration symbol.
In another aspect of the present invention the selecting step includes selecting the registration symbol from a group of symbols that appear in either of the strings R and T within a predefined tolerance of an average number of times as compared with all other symbols that appear in either of the strings R and T.
In another aspect of the present invention the selecting step includes selecting the most randomly distributed symbol among the group of symbols as the registration symbol.
In another aspect of the present invention the constructing steps include constructing the first vector R′ from the position value of the first appearance of the registration symbol in the string R, and thereafter from the relative distance values between subsequent appearances of the registration symbol in the string R, and constructing the second vector T′ from the position value of the first appearance of the registration symbol in the string T, and thereafter from the relative distance values between subsequent appearances of the registration symbol in the string T.
In another aspect of the present invention the deriving step includes identifying the common substring pair R′
CS
and T′
CS
in the vectors R′ and T′ respectively using an LCS technique.
In another aspect of the present invention the deriving step includes calculating the offset of R
CS
in R by summing the values in R′ that precede R′
CS
, calculating the offset of T
CS
in T by summing the values in T′ that precede T′
CS
, and calculating the length of R
CS
in R and T
CS
in T by summing all the values in either of R′
CS
and T′
CS
.
In another aspect of the present invention the identifying step includes correspondingly adjusting the length of any of the substring pairs R
CS
and T
CS
until a longest possible common superstring is determined for the substring pair R
CS
and T
CS
.
In another aspect of the present invention the selecting step includes constructing a hash function h
k
j
=(d
k
j
mod z) for a string s where s is either of the strings R and T, k is a character appearing in the string s, k being a member of a character set having z possible members, d
k
is a vector of the relative distances between consecutive appearances of k in s, and j is an index of a single element in vector d
k
, for each value of k in s, generating a histogram h
k
of the hash functions of the relative distances d
k
j
between appearances of k in s, where d
k
j
is the relative distance between consecutive appearances j and j+1 of k in s, computing the randomness f
k
of the distribution of each k in s as
f
k
=
σ
k
N
k
where N is the number of appearances of k in s and &sgr; is the standard deviation of the histogram h
k
, and selecting the character k for which f
k
is minimal as the registration symbol.
It is appreciated throughout the specification and claims that the terms “string,” “substring,” and “symbol sequence” refer to any sequence of bits, bytes, words, or any other sequence of information coding units.
The disclosures of all patents, patent applications, and other publications mentioned in this specification and of the patents, patent applications, and other publications cited therein are hereby incorporated by reference in their entirety

Affiliated with

Yariv Shalom

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

AlphaPatent Associates Ltd.

Law Firm

[ 0.00 ] – not rated yet Voters 0 Comments 0

Nguyen John

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

Proxell Systems Ltd.

Corporate Assignee

[ 0.00 ] – not rated yet Voters 0 Comments 0

Swirsky Daniel J.

Attorney

[ 0.00 ] – not rated yet Voters 0 Comments 0

Young Brian

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for identifying longest common substrings does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for identifying longest common substrings, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for identifying longest common substrings will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-2840215

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure