Image analysis – Pattern recognition – Context analysis or word recognition
Reexamination Certificate
1999-04-08
2003-12-02
Mehta, Bhavesh M. (Department: 2625)
Image analysis
Pattern recognition
Context analysis or word recognition
C382S160000, C382S177000, C382S228000, C382S232000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06658151
ABSTRACT:
The present invention relates to the field of document image processing, and more particularly to processing document images that have been symbolically compressed.
BACKGROUND OF THE INVENTION
Storage and transmission of electronic document images have become increasingly prevalent, spurring deployment and standardization of new and more efficient document compression techniques. Symbolic compression of document images, for example, is becoming increasingly common with the emergence of the JBIG2 standard and related commercial products. Symbolic compression techniques improve compression efficiency by 50% to 100% in comparison to the commonly used Group 4 compression standard (CCITT Specification T.6). A lossy version of symbolic compression can achieve 4 to 10 times better compression efficiency than Group 4.
In symbolic compression, document images are coded with respect to a library of pattern templates. Templates in the library are typically derived by grouping (clustering) together connected components (e.g., alphabetic characters) in the document that have similar shapes. One template is chosen or generated to represent each cluster of similarly shaped connected components. The connected components in the image are then represented by a sequence of template identifiers and their spatial offsets from the preceding component. In this way, an approximation of the original document is obtained without duplicating storage for similarly shaped connected components. Minor differences between individual components and their representative templates, as well as all other components which are not encoded in this manner, are optionally coded as residuals.
Many document management activities, such as document classification, duplicate detection and language identification, are based on the semantic content of document images. Consequently, in traditional document management systems, compressed document images are first decompressed then subjected to optical character recognition (OCR) to recover the semantic information needed for classification, language identification and duplicate detection. In the context of a database of symbolically compressed document images, the need to decompress and perform OCR consumes considerable processing resources. Also, because OCR engines are usually limited in the number and variety of typefaces they recognize, recovery of semantic information through conventional OCR techniques may not be possible for some symbolically compressed documents.
SUMMARY OF THE INVENTION
A method and apparatus for extracting information from symbolically compressed document images are disclosed. An input document image is represented by a sequence of template identifiers to reduce storage consumed by the input document image. The template identifiers are replaced with alphabet characters according to language statistics to generate a text string representative of text in the input document image. In one embodiment, the template identifiers are replaced with alphabet characters according to a hidden Markov model. Also, a conditional n-gram technique may be used to obtain indexing terms for document matching and other applications.
These and other features and advantages of the invention will be apparent from the accompanying drawings and from the detailed description that follows below.
REFERENCES:
patent: 4610025 (1986-09-01), Blum et al.
patent: 5062143 (1991-10-01), Schmitt
patent: 5418951 (1995-05-01), Damashek
patent: 5452442 (1995-09-01), Kephart
patent: 5467425 (1995-11-01), Lau et al.
patent: 5752051 (1998-05-01), Cohen
patent: 5809172 (1998-09-01), Melen
patent: 5809476 (1998-09-01), Ryan
patent: 5982929 (1999-11-01), Han et al.
patent: 6011905 (2000-01-01), Huttenlocher et al.
patent: 6038342 (2000-03-01), Bernzott et al.
patent: 6052481 (2000-04-01), Grajski
patent: 6088484 (2000-07-01), Mead
patent: 6092038 (2000-07-01), Kanevsky
patent: 6118899 (2000-09-01), Bloomfield et al.
patent: 6157905 (2000-12-01), Powell
patent: 6617369 (2000-12-01), Schulze
patent: 6169969 (2001-01-01), Cohen
patent: 6311152 (2001-10-01), Bai et al.
Hull Jonathan J.
Lee Dar-Shyang
Blakely , Sokoloff, Taylor & Zafman LLP
Desire Gregory
Mehta Bhavesh M.
Ricoh Co. Ltd.
LandOfFree
Extracting information from symbolically compressed document... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Extracting information from symbolically compressed document..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Extracting information from symbolically compressed document... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3103795