Coded data generation or conversion – Digital code to digital code converters – Unnecessary data suppression
Reexamination Certificate
2003-02-07
2004-06-15
Jeanglaude, Jean (Department: 2819)
Coded data generation or conversion
Digital code to digital code converters
Unnecessary data suppression
C341S051000
Reexamination Certificate
active
06750791
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to compression of text-based data or protocol data and, more specifically, to compression of data to be transmitted over networks having bandwidth limited communication links.
BACKGROUND OF THE INVENTION
In general, data compression converts data defined in a given format to another format containing fewer data bits than the original format. When the original data is needed, the compressed data is decompressed to restore the data to the original format. Data compression can be classified as lossy and lossless. As the term suggests, data is preserved during compression and decompression in a lossless method. Unlike lossless compression, lossy compression refers to methods in which the decompressed data is not exactly the same as the original data. Lossless data compression algorithms can be classified into dictionary coding and statistical coding types. The present invention is related to the dictionary coding type. The most widely used dictionary coding algorithms are the Lempel-Ziv algorithms and their variants. In particular, the LZ77 algorithm refers to the compression method as disclosed in Ziv et al. (“A Universal Algorithm for Sequential Data Compression”,IEEE Transactions on Information Theory, Vol. IT-23, No.3, May 1977, pp.337-343) and the LZ78 refers to the method as disclosed in Ziv. et al. (“Compression of Individual Sequences via Variable Rate Coding”,IEEE Transactions on Information Theory, Vol. IT-24, No.5, September 1978, pp.530-535). LZ77 is based on the principle of replacing a repeated sequence of characters by a reference to an earlier occurrence of the sequences by a pointer. LZ78 parses a stream of input data characters into coded values based on an adaptively growing reference source, such as a look-up table or dictionary, for string matching.
In many text-based applications and application protocols, some or all of the text data are case insensitive. In case-insensitive text data, the semantics of the text data are the same regardless of the texts being represented in lowercase or uppercase. For example, most HTTP (Hypertext Transfer Protocol) message header fields, including HTTP keywords, are case-insensitive. Thus, the URI (Uniform Resource Identifier) http://Nokia.com is equivalent to the URI HTTP://NOKIA.COM or http:/
okia.com. In addition, if an HTTP message body contains an HTML (HyperText Markup Language) document, all element names and attribute names in the HTML documents are case insensitive. The Lempel-Ziv algorithms treat all data as pure bytes in their string match logic and, therefore, do not address the issue relating to case sensitivity. As a consequence, a string of texts in the input data will not be compressed even though the dictionary already contains the same string with a different case configuration. This not only limits the compression ratio, but also causes waste in the use of the dictionary and its memory storage. In particular, the problem affects compression performance in two scenarios: 1) a pre-populated dictionary with protocol or application specific data, such as keywords; and 2) application data generated by one source as dictionary to compress same application data generated by another source.
It is thus advantageous and desirable to provide a method and device for improving the compression ratio in text-based data compression.
SUMMARY OF THE INVENTION
It is a primary objective of the present invention to provide a method and device for efficiently compressing text-based data or protocol data.
According to the first aspect of the present invention, there is provided a method of coding communication data in a form of data segments for providing encoded data. The method is characterized by
finding a match for a data segment in a reference source; and
compressing the data segment for providing a compressed data segment in the encoded data if the match is found, wherein said finding is carried out in a manner based on case sensitivity of the data segment.
A data segment comprises at least one data unit and is either case sensitive or case insensitive. The data segment is case insensitive if said at least one data unit is case insensitive.
Advantageously, the method is further characterized by
modifying the data segment if the data segment is case insensitive for providing a modified data segment so as to allow said finding to be based on a match for the modified data segment in the reference source.
When the data segment comprises a text string and each of said at least one data unit comprises a character, the character is either in a first case configuration or in a second case configuration, said method characterized in that said finding is carried out in a manner as if the data segment is case insensitive, and the method is further characterized by
providing in the encoded data information indicative of the case configuration of the character in each of said at least one data unit so as to allow the compressed data segment to be decompressed based on the provided information.
When the text string comprises a number of characters, and the provided information comprises a number of data bits, each corresponding to one of said number of characters, each data bit is assigned a value indicative of whether the corresponding character is in the first case configuration or in the second case configuration.
Alternatively, the provided information comprises a code having a value indicative of the character in each of said at least one data unit being in a first case configuration. The first case configuration is lowercase and the second case configuration is uppercase. It is also possible that the first case configuration is uppercase and the second case configuration is lowercase.
Advantageously, the text string comprises a plurality of characters including a leading character and at least one following character, and the provided information comprises a code indicative of only the leading character in the text string being in a first case configuration.
According to the second aspect of the present invention, there is provided a compressor for encoding communication data in a form of data segments for providing encoded data. The compressor is characterized by
a reference source;
a comparison means for finding a match of a data segment in the reference source; and
an encoding module for compressing the data segment if the match is found for providing a compressed data segment in the encoded data, wherein the comparison means has a matching algorithm for finding the match based on case sensitivity of the data segment.
The compressor is further characterized by
a parser, responsive to the data segment, for determining the case sensitivity of the data segment and for providing information indicative of the case sensitivity of the data segment to the comparison means, and the provided information is contained in a data flag conveyed to the comparison means.
Advantageously, the data segment comprises at least one character and each of said at least one character is either in a first case configuration or in a second case configuration. The compressor is further characterized by
a transformer module, responsive to the data segment, for converting each of said at least one character in the data segment to a first case configuration for providing a case-transformed data segment to the comparison means so as to allow the comparison means to find the match in the reference source based on the case-transformed data segment.
Alternatively, when the data segment comprises at least one character and each of said at least one character is either in a first case configuration or in a second case configuration, the compressor is characterized in that
the compressor finds the match of the data segment in the reference source as if the data segment is case insensitive, and that
information indicative of the case configuration of each of said at least one character is provided in the encoded data so as to allow the compressed data segment from the encoded data to be decompressed bas
Jeanglaude Jean
Nokia Corporation
Ware Fressola Van Der Sluys & Adolphson LLP
LandOfFree
Method and device for text data compression does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and device for text data compression, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and device for text data compression will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3335756