Page analysis system

Image analysis – Image segmentation – Distinguishing text from other regions

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C358S462000

Reexamination Certificate

active

06512848

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a page analysis system for analyzing image data of a document page by utilizing a block selection technique, and particularly to such a system in which blocks of image data are classified based on characteristics of the image data. For example, blocks of image data may be classified as text data, titles, half-tone image data, line drawings, tables, vertical lines or horizontal lines.
2. Incorporation by Reference
U.S. patent applications Ser. No. 07/873,012, “Method And Apparatus For Character Recognition”, Ser. No. 08/171,720, “Method And Apparatus For Selecting Text And Or Non-Text Blocks In A Stored Document”, Ser. No. 08/596,716, “Feature Extraction System For Skewed And Multi-Orientation Documents”, and Ser. No. 08/338,781, “Page Analysis System”, which are commonly owned by the assignee of the present invention, are incorporated herein by reference.
3. Description of the Related Art
Recently developed block selection techniques, such as the techniques described in the aforementioned U.S. patent application Ser. Nos. 07/873,012 and 08/171,720, are used in page analysis systems to provide automatic analysis of image data within a document page. In particular, these techniques are used to distinguish between different types of image data within the page. The results of such techniques are then used to choose a type of processing to be subsequently performed on the image data, such as optical character recognition (OCR), data compression, data routing, etc. For example, image data which a block selection technique has designated as text data is subjected to OCR processing, whereas image data which is designated as picture data is subjected to data compression. Due to the foregoing, various types of image data can be input and automatically processed without requiring user intervention.
Block selection techniques are most beneficial when applied to composite documents.
FIG. 1
shows an image of composite document page
1
as it appears after being subjected to a block selection technique. Document page
1
includes a logo within block
2
, a large font title within blocks
3
to
6
, large font decorative text within block
7
, text-sized decorative font within blocks
8
to
13
, various text-sized symbols within blocks
14
to
27
and a small symbol pattern within blocks
28
to
35
.
Block selection techniques use a “blocked” document image such as that shown in
FIG. 1
to create a hierarchical tree structure representing the document.
FIG. 2
shows a hierarchical tree which represents document page
1
. The tree consists of root node
101
, which represents document page
1
, and various descendent nodes. Descendent nodes
102
,
102
,
104
to
106
,
107
,
108
to
113
,
114
to
127
and
128
to
145
represent blocked areas
2
,
3
to
6
,
7
,
8
to
13
,
14
to
27
and
28
to
35
, respectively.
In order to construct such a tree, block selection techniques such as those described in U.S. patent application Ser. Nos. 07/873,012 and 08/171,720 search each area of document page
1
to find “connected components”. As described therein, connected components comprise two or more pixels connected together in any of eight directions surrounding each subject pixel. The dimensions of the connected components are rectangularized to create corresponding “blocked” areas. Next, text connected components are separated from non-text connected components. The separated non-text components are thereafter classified as, e.g., tables, half-tone images, line drawings, etc. In addition, block selection techniques may combine blocks of image data which appear to be related in order to more efficiently process the related data.
The separation and classification steps are performed by analyzing characteristics of the connected components such as component size, component dimension, average size of each connected component, average size of internal connected components and classification of adjacent connected components. However, despite using complex algorithms in conjunction with the foregoing factors in order to classify blocks of image data, block selection techniques often mis-identify or are unable to identify blocks of data within a document page.
For example, as shown in
FIG. 2
, a conventional block selection technique may not be able to distinguish the content of blocks
2
,
3
and
7
of page
1
. Accordingly, corresponding nodes
102
,
103
and
107
are designated “unknown”.
These problems occur because the classification algorithms applied by conventional block selection techniques are premised on many assumptions relating to data size, e.g., any data which falls within a given size threshold is classified as text data. Accordingly, any text data outside of that threshold will most likely not be characterized as text data. Also, text and non-text connected components are separated based on an assumption that text connected components are usually smaller than picture connected components. In addition, the algorithms also assume that text connected components comprise the majority of the connected components in a document page.
Accordingly, conventional block selection techniques are inherently inaccurate because they rely on assumptions regarding size-related characteristics of document image data and do not attempt to actually recognize the content of the image data.
Mis-identification of document image data due to these inherent inaccuracies results in significant problems when combining related blocks of image data. For example, the combining algorithm used in the present example requires that blocks which a block selection technique has designated as “unknown” be combined with any adjacent text blocks. Accordingly, because “unknown” blocks
2
and
3
of document page
1
are adjacent to “text” blocks
4
to
6
, these blocks are grouped together to form “text” block
36
, shown in FIG.
3
. Therefore, the logo within original block
2
will be mistakenly processed as text. As also shown in
FIG. 3
, blocks
7
to
13
,
14
to
27
and
28
to
35
are combined into single “text” blocks
38
,
39
and
40
, respectively.
Techniques have been developed to address the tendency of existing block selection techniques to mis-identify and/or erroneously combine image data. For example, U.S. patent application Ser. No. 08/361,240 describes a method for reviewing the data classifications resulting from a block selection technique and for editing the classifications in the case that any image data was misidentified by the block selection technique. However, such techniques require operator intervention and are therefore not adequate in cases where automation of the block selection technique is required.
SUMMARY OF THE INVENTION
The present invention relates to a method for classifying blocks of image data within a document page which utilizes optical character recognition processing to address shortcomings in existing block selection techniques.
Thus, according to one aspect of the invention, the present invention is a method for increasing the accuracy of image data classification in a page analysis system for analyzing image data of a document page. The method includes inputting image data of a document page as pixel data, analyzing the pixel data in order to locate all connected pixels, rectangularizing connected pixel data into blocks, analyzing each of the blocks of pixel data in order to determine the type of image data contained in the block, outputting an attribute corresponding to the type of image data determined in the analyzing step, and performing optical character recognition so as to recognize the type of image data in the block of image data in the case that the analyzing step cannot determine the type of image data contained in the block.
In another aspect, the present invention is a method for accurately classifying image data in a page analysis system for analyzing image data of a document page. The method includes inputting image data of a document page as pixel

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Page analysis system does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Page analysis system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Page analysis system will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3055691

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.