Data processing: artificial intelligence – Machine learning – Genetic algorithm and genetic programming system
Reexamination Certificate
1998-12-21
2001-07-10
Powell, Mark R. (Department: 2122)
Data processing: artificial intelligence
Machine learning
Genetic algorithm and genetic programming system
C706S045000, C382S243000, C382S232000
Reexamination Certificate
active
06260031
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to the field of data storage, and in particular to the compaction of data and programming code for efficient storage.
2. Description of Related Art
In this “information age”, the amount of information that is being stored or communicated continually increases. Despite advances in storage and communications technologies, however, a continuing need exists for optimizing the efficiency of information storage and communication by compacting the information itself. For example, physical space in a portable computing device is a premium resource, and has a practical limit. The minimization of storage requirements for application programs and data files will particularly benefit devices and systems having limited storage resources, such as portable computing devices, communications devices, electronic appliances, and the like. In a similar manner, a limited bandwidth communication channel can be more efficiently utilized by reducing the number of data elements required to communicate a given amount of information.
One method for compacting information within a data set is a dictionary lookup scheme. Symbols are defined in a dictionary to represent sequences that are repeated within the data set. Each occurrence of the sequence in the data set is replaced by the symbol representing the sequence. If the substituted symbol and the overhead associated with the dictionary consumes less storage space than the original occurrences of the sequences, then a savings in storage space is achieved.
FIG. 1
illustrates such a compaction. The uncompacted data set
100
is converted to a compacted data set
110
by creating a macro dictionary
120
that associates unique symbols A, B, C, and D to represent sequences of symbols ‘ec’, ‘aab’, ‘cad’, and ‘dab’, respectively, in the uncompacted sequence
100
. Consistent with common terminology, the sequences of symbols that are replaced by another symbol are termed macros; the process of replacing macro sequences by their macro symbol is termed tiling. The macro A, corresponding to sequence ‘ec’
101
, is identified as the first sequence in the data set
100
that is repeated, at
101
′. The macro B, corresponding to sequence ‘aab’
102
, is identified as the next sequence in the data set
100
that is repeated, at
102
′. The macro C, corresponding to sequence ‘cad’
103
, is identified as the next sequence in the data set
100
that is repeated, at
103
′. The macro D, corresponding to sequence ‘dab’
104
, is identified as the next sequence in the data set
100
that is repeated, at
104
′,
104
″. The original data set
100
is converted to the compacted data set
110
by replacing each occurrence
101
,
101
′,
102
,
102
′,
103
,
103
′,
104
,
104
′,
104
″ of the macro sequences ‘ec’, ‘aab’, ‘cad’, and ‘dab’, with their corresponding macro symbols A, B, C, and D, respectively. The original data set
100
is recoverable from the compacted data set
110
and the macro dictionary
120
by replacing each occurrence of the macro symbols A, B, C, D, with the corresponding macro sequences ‘ec’, ‘aab’, ‘cad’, and ‘dab’.
The original data set
100
contains 28 symbols. The compacted data set
110
contains 12 symbols and the dictionary
120
contains 11 symbols. Assuming that the storage requirements for the data sets
100
,
110
and dictionary
120
are directly proportional to the number of symbols being stored, the compaction of
FIG. 1
results in an 18% storage requirement reduction ((28−(12+11))/28). In like manner, if the original data set
100
is intended for communication via a limited bandwidth communications channel, the transmission of the compacted data set
100
and dictionary
120
will result in an 18% bandwidth reduction.
As is evident to one of ordinary skill in the art, the aforementioned reduction in storage and communication requirements depends upon the selection of sequences that are replaced by macro symbols. For example, if macro sequence ‘ec’
101
,
101
′ had not been selected for replacement by the macro symbol A, a macro sequence ‘caab’ could have been formed by including the symbol c in the macro sequences
102
,
102
′, resulting in a reduction in the number of symbols in the dictionary
120
and compacted data set
110
. Other macro sequence groupings may result in an increase or decrease in the number of symbols in the dictionary
120
or compacted data set
110
.
A common method of determining which sequences to select as macros is based on a “greedy” algorithm. A worth figure is associated with each possible macro sequence, and the greedy algorithm, as its name implies, selects the sequence having the greatest worth. Thereafter, the next macro sequence having the greatest marginal worth is selected, and so on.
FIG. 2
illustrates an example determination of worth for repeated sequences in the data set
100
. Each repeated sequence of up to four symbols in the data set
100
is evaluated for worth, independent of all other sequences. Consider, for example, the sequence ‘ec’, which occurs twice
101
,
101
′ in the data set
100
. Placing this sequence in the dictionary
120
requires two symbols; replacing the occurrences
101
,
101
′ of the macro sequence ‘ec’ with the macro symbol A saves one symbol per occurrence
101
,
101
′. Thus, the worth of using the sequence ‘ec’ as a macro is zero (2 occurrences *1 symbol savings per occurrence-2 symbols in the dictionary), as illustrated in the line
201
of FIG.
2
. The sequence ‘ca’, on the other hand, occurs four times in the data set
100
, and has a worth of two (4 occurrences *1 symbol savings per occurrence-2 symbols in the dictionary), as illustrated in the line
202
of FIG.
2
. In like manner, illustrated in the line
205
of
FIG. 2
, the sequence ‘abca’ occurs three times in the data set
100
, and has a worth of five (3 occurrences *3 symbol savings per occurrence-4 symbols in the dictionary). As would be evident to one of ordinary skill in the art, if the macro symbol is smaller or larger in size than the original symbols in the data set, or the storage of the original symbols in the dictionary requires more or less storage resources, the determination of the worth of each sequence will be affected correspondingly.
The “worth” of each sequence can be defined in a number of ways. In the above examples, the worth is defined solely as the reduction in storage requirements. The worth of a macro may also be based, for example, upon the difficulty or time penalty associated with replacing certain sequences, and so on. That is, for example, short sequences may be easier or quicker to replace than long sequences, and thus the worth of longer sequences may be attenuated, as compared to a worth based solely on storage requirements. However the worth is defined, the greedy algorithm is used to attempt to maximize the defined worth of the compaction process by selecting the sequences having the maximum defined worth. As is known to one of ordinary skill in the art, however, selecting sequences having maximum worth does not necessarily guarantee that the final solution will produce a maximum worth.
FIG. 3
illustrates the results of applying the greedy algorithm to the data set
100
using the defined worth determinations illustrated in FIG.
2
. The sequence ‘abca’ has the maximum worth, as defined above, with a symbol savings of
5
. Replacing each occurrence
301
,
301
′,
301
″ of ‘abca’ with macro symbol A
311
,
311
′,
311
″ in the compacted data set
310
results in a 5 symbol reduction in the total number of symbols (19 symbols in the compacted data set
310
, plus 4 symbols to store the sequence ‘abca’ in a dictionary (not shown)). In accordance with the greedy algorithm, the repeated sequences in the compacted data set
310
are then identified and evaluated. Note that the worth of a given sequence may be different when applied to a compacted data set t
Eshelman Larry J.
Mathias Keith E.
Schaffer J. David
Philips Electronics North America Corp.
Powell Mark R.
Starks Wilbert L.
Thorne Gregory L.
LandOfFree
Code compaction by evolutionary algorithm does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Code compaction by evolutionary algorithm, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Code compaction by evolutionary algorithm will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2546617