OCR-based image compression

Image analysis – Pattern recognition – Classification

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S218000, C382S317000, C382S321000, C382S233000, C358S426160

Reexamination Certificate

active

06487311

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to data compression in general, and in particular to lossy data compression systems.
BACKGROUND OF THE INVENTION
Data compression systems are well-known in the art. One well-known way of classifying data compression systems includes the categories of “lossy” and “lossless” compression systems. In lossy compression systems, when an original image is compressed and then decompressed, the resulting image has lost some information and is not identical to the original image. In lossless compression systems, by contrast, compression and decompression restores the original image without any loss. Today, many data compression systems are used for compressing images, although compression of non-image data is also well-known in the art.
Compression ratio is a well-known measure of the efficiency of a data compression system. Generally, compression ratio is taken as equal to the ratio of the size of the uncompressed image to the size of the compressed image; thus, a higher compression ratio is, all other things. being equal, taken to be better than a lower compression ratio. The compression ratio of general-purpose lossless algorithms is believed to be asymptotically bounded by the ratios achieved by Ziv-Lempel type algorithms. For binary images, that is, images having 2-valued pixels usually representing “black” and “white”, compression ratio is believed to be bounded by the ratios achieved by G4 type algorithms.
Prior art image compression methods are described in A. N. Netravali and J. O. Limb, “Picture coding: A review”, Proc. IEEE, vol. 68, pp. 366-406, March 1980.
The disclosures of all references mentioned above and throughout the present specification are hereby incorporated herein by reference.
SUMMARY OF THE INVENTION
The present invention seeks to provide an improved system of data compression, particularly suited to the compression of images including text data.
The apparatus and method of the present invention use content specific information. Specifically, in the present invention a digital document, also known herein as a “digital image” or “digitized image”, is passed through an OCR (optical character recognition) process. Classes of similar characters in the document are identified and, if the characters actually have sufficiently similar shapes, all of the characters in each class are replaced with one character template and an indication of the location of each character, with the template and the location information being stored. Characters thus stored are removed from the image, resulting in a residual image, which is also stored. Typically, both the template and the residual image on the one hand, and the location information on the other hand, are separately compressed, typically using conventional techniques, before storage.
There is thus provided in accordance with a preferred embodiment of the present invention a method for compressing a digitized image of a document, the method including performing optical character recognition (OCR) on the digitized image, identifying, based, at least in part, on a result of the performing step, a plurality of classes of characters included in the image, each the class of characters having an associated character value and including at least one character, pruning each class of characters, thereby producing information describing the plurality of classes of characters, and a residual image, and utilizing the information describing the plurality of classes of characters and the residual image as a compressed digitized image in further processing.
Further in accordance with a preferred embodiment of the present invention the digitized image includes a binary image.
Still further in accordance with a preferred embodiment of the present invention the utilizing step includes storing the information describing the plurality of classes of characters and the residual image.
Additionally in accordance with a preferred embodiment of the present invention the utilizing step includes transmitting the information describing the plurality of classes of characters and the residual image.
Moreover in accordance with a preferred embodiment of the present invention the utilizing step includes compressing the residual image.
Further in accordance with a preferred embodiment of the present invention the utilizing step includes compressing the information describing the plurality of classes of characters.
Still further in accordance with a preferred embodiment of the present invention the pruning step includes producing and storing template-location information describing each class of characters, and erasing each character included in each class of characters from the scanned image, thereby producing a residual image.
Additionally in accordance with a preferred embodiment of the present invention the producing and storing step includes identifying a template image for the class of characters, storing the template image, and storing image-location information for each character included in the class of characters.
Moreover in accordance with a preferred embodiment of the present invention the step of storing image-location information includes storing an identifying code for each character included in the class of characters, and storing location information for each character included in the class of characters.
Further in accordance with a preferred embodiment of the present invention the identifying code includes at least one of the following a standard character encoding code, the standard character encoding code being based, at least in part, on a result of the performing OCR step, and a customized character code, the customized character code being based, at least in part, on a result of the performing OCR step.
Still further in accordance with a preferred embodiment of the present invention the pruning step includes for each one of the plurality of classes of characters performing a shape-matching comparison test between at least two characters included in the one class, and removing from the one class characters which fail the shape-matching comparison test.
Additionally in accordance with a preferred embodiment of the present invention the method also includes aggregating a plurality of characters into at least one additional class of characters, each the addition class of characters being associated with a customized character code, wherein the pruning step and the storing step also operate on the at least one additional class of characters.
Moreover in accordance with a preferred embodiment of the present invention the method also includes scanning the document to produce the digitized image.
Further in accordance with a preferred embodiment of the present invention the compressing the residual image step includes compressing the residual image using a G4 compression method.
There is also provided in accordance with another preferred embodiment of the present invention a compressed digital image including information describing a plurality of classes of characters, and a residual image.
Further in accordance with a preferred embodiment of the present invention the information describing each of the plurality of classes of characters includes template information and image-location information.
Still further in accordance with a preferred embodiment of the present invention the information includes compressed information.
Additionally in accordance with a preferred embodiment of the present invention the residual image includes a compressed residual image.
There is also provided in accordance with another preferred embodiment of the present invention a method for compressing a digitized image of a document, the method including for at least one class of similar characters included in the digitized image, removing all characters included in the class from the image, thus producing a residual image, and producing a template and image-location information describing the all characters included in the class, and compressing at least one of the following the residual image, the template,

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

OCR-based image compression does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with OCR-based image compression, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and OCR-based image compression will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2983145

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.