Data compression apparatus

Pulse or digital communications – Bandwidth reduction or expansion

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C341S051000, C341S087000, C341S106000, C707S793000

Reexamination Certificate

active

06304601

ABSTRACT:

FIELD OF THE INVENTION
The present invention concerns the handling of data and in particular the handling and compression of text data.
BACKGROUND OF THE INVENTION
Every year the amount of data electronically stored and then accessed by users grows substantially. One example of this has been driven by the increased viability of high-density optical storage discs. There are now many organisations which send out data in the form of optical discs at regular intervals. The data can either be data which has only recently become available to the general public, such as newly published or granted patent specifications, or already existing publications which have been collated. It will be appreciated that whilst modern technology allows a user to scan rapidly through the contents of an optical disc, problems arise when it is required to scan through the contents of a large number of such discs. An acute example as to how such a problem can arise is when users have optical disc databases of patent information which is updated on a monthly basis in response to the publication of pending or granted patent specifications. Whilst it may well be advantageous for a user to scan through one or more discs for relevant information, any attempt to extract data over a greater period of time becomes labour intensive. A solution to this problem is to down-load the text stored in a recently received optical disc and combine this text with the previously received text in a single large database. It will be appreciated than when a single optical disc can hold the contents of 10,000 substantial documents such as patent specifications a single database holding all this information must have a very substantial capacity. It has accordingly become quite common to store data, and in particular textual information, in compressed form. Normally text is stored on the optical disc in the widely accessible ASCII format. Text compression algorithms are known and can reduce the storage requirements for large quantities of text originally stored in ASCII format by as much as 70%. Some compression algorithms are known as “lossy” as they cause the problem that the actual format and layout of the text is lost on decompression. Other algorithms maintain the format of the text. However both types of known compression algorithm have the drawback that it is very difficult to index nested sections within the complete text and decompress these sections alone. It can thus be seen that any user of a database has to meet two requirements which at present conflict. Either the data can be stored in uncompressed form so that it can be readily indexed but will take up a substantial amount of expensive storage capacity, or the data can be compressed so as to reduce the required storage capacity but with the attendant problem that the data is then difficult to access and extract.
SUMMARY OF THE INVENTION
Accordingly the present invention is concerned with providing a method of compressing textual data which provides both substantial compression and which allows the compressed data to be indexed in such a manner that sections of data can be readily accessed and decompressed by a user. The present invention is also concerned with providing a signal format which enables textual data to be rapidly and efficiently transmitted from one location to another.
In accordance with a first aspect of the invention there is provided apparatus for compressing text comprising:
means for splitting a main character string into component strings wherein the splitting means in operation splits the main character string in two stages; a first stage in which the main character string is split into strings of multiple spaces which represent part of the final component strings and strings which include single spaces, words and punctuation, and a second stage in which the non-multiple space strings are split in accordance with a splitting algorithm into words, punctuation and single spaces which represent the remainder of the component strings;
means for counting the frequency of occurrence of each component string in the main character string and ordering the component strings in their frequency of occurrence;
means for allocating to each component string apart from single spaces a token value representative of the component string and determined by the frequency of occurrence of the component string;
means for storing the token values so allocated as a token table;
means for allocating to each component string in the main character string the token value for that component string from the token table to generate a sequence of token values representing the main character string in a compressed format; and
means for storing the sequence of token values, and wherein said splitting algorithm enables the original document to be reconstituted faithfully including the single spaces which have effectively been discarded.
In accordance with a second aspect of the invention there is provided a method of compressing text comprising:
splitting a main character string into component strings wherein the splitting operation splits the main character string in two stages; a first stage in which the main character string is split into strings of multiple spaces which represent part of the final component strings and strings which include single spaces, words and punctuation, and a second stage in which the non-multiple space strings are split in accordance with a splitting algorithm into words, punctuation and single spaces which represent the remainder of the component strings;
counting the frequency of occurrence of each component string in the main character string and ordering the component strings in their frequency of occurrence;
allocating to each component string apart from single spaces a token value representative of the component string and determined by the frequency of occurrence of the component string;
allocating to each component string in the main character string the token value for that component string from the token table to generate a sequence of token values representing the main character string in a compressed format; and
storing the sequence of token values, and wherein said splitting algorithm enables the original document to be reconstituted faithfully including the single spaces which have effectively been discarded.
Other aspects of the invention include apparatus and a method for decompressing text; apparatus for both compressing and decompressing text; compressed text in the form of a signal which can be either optical or electronic; and a storage medium on which is stored text compressed in accordance with the present invention.
It will, of course, be understood that optical discs are only one storage medium and that many other storage mediums are available for storing both uncompressed and compressed data. Additionally compression techniques can be advantageous when data has to be transmitted either over fixed lines which may be optical fibres or via radio.


REFERENCES:
patent: 4511758 (1985-04-01), Konishi et al.
patent: 4955066 (1990-09-01), Notenboom
patent: 5023610 (1991-06-01), Rubow et al.
patent: 5111398 (1992-05-01), Nunberg et al.
patent: 5151697 (1992-09-01), Bunton
patent: 5224038 (1993-06-01), Bespalko
patent: 5353024 (1994-10-01), Graybill
patent: 5410671 (1995-04-01), Elgamal et al.
patent: 5561421 (1996-10-01), Smith et al.
patent: 5771010 (1998-06-01), Masenas
patent: 5890103 (1999-03-01), Carus
patent: 5933104 (1999-08-01), Kimura
patent: 0199035 (1986-10-01), None
patent: WO 88/09586 (1988-12-01), None
White: “Printed English Compression by Dictionary Encoding”, IEEE, vol. 55, No. 3, Mar. 1967, pp. 390-396.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Data compression apparatus does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Data compression apparatus, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Data compression apparatus will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2569434

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.