Image analysis – Pattern recognition – Feature extraction
Reexamination Certificate
1998-11-24
2003-05-13
Boudreau, Leo (Department: 2621)
Image analysis
Pattern recognition
Feature extraction
C382S164000, C382S173000, C382S177000, C382S178000, C382S180000, C382S181000, C382S284000, C382S295000, C348S584000
Reexamination Certificate
active
06563949
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to image recognizing technology for reading documents, drawings, etc., and more specifically to character string extracting technology for extracting a character string such as a headline from an image.
2. Description of the Related Art
When a document is electronically filed, it is indispensable to assign a keyword to the document. However, it is a very annoying process for a user. Therefore, it is significant to successfully automate this process and efficiently perform an electronic filing process.
In newspapers and magazines, it is the most efficient to automatically extract a headline, recognize characters forming the headline, and define it as a keyword because a headline indicates a lot of characteristics of the contents of a document, and can be easily retrieved from the document.
As a result, technology of shortening the time taken to extract a keyword, and of correctly extracting a keyword (for example, Tokkaihei 4-287168 which discloses a method of automatically extracting a keyword from a file) has been developed.
In this method, it is assumed that the descriptions of a drawing, a photograph, and a table are positioned at the top or foot of each of them to be described. Thus, the descriptions can be extracted as a character string or a character string area, and the characters forming the descriptions are recognized and entered as a keyword.
Also, technology of extracting a character string from an image (for example, Tokkaihei 8-293003 which discloses a character string extracting method and a character string extraction apparatus based on the method, a character string recognizing apparatus and a character string recognizing system) has been developed.
In this example, all characters in the image are extracted, continuous characters are grouped as a character string, the feature amount of each group is compared with an entered feature amount model, discriminated, and extracted as a character string. Continuous characters refers to a character string, and a feature amount refers to a type and size of a character such as a kanji (Chinese character), a numerical character, etc.
Thus, there are various documents and drawings to be treated in an electronic filing process, and there are various image recognition technology. When a character string is extracted from an image, the most popular method is to process a headline which has its background as often seen in newspapers.
First, it is determined whether an input image contains vertically arranged characters or horizontally arranged characters. Then, a labelling process is performed on the input image and its black/white inverse image to obtain connected elements having a series of picture elements in the same color.
Next, a character candidate is found based on the size, thickness, and relative position of each connected element.
The character candidate obtained from the connected element of an input image is referred to as a black character candidate, and the character candidate obtained from the connected element of an inverse image is referred to as a white character candidate. The color of characters is determined from the number of black character candidates and white character candidates. When the character color is black, only the connected elements of an input image are to be processed in the subsequent steps. When the character color is white, only the connected elements of the black/white inverse image are to be processed thereafter.
Next, a character string area is obtained after merging the connected elements to be processed. The connected element which is contained in the character string area and is equal to or larger than a threshold in thickness is extracted as a character element. The threshold is a value indicating a constant ratio to the maximum value in thickness of a connected element. Finally, the connected element extracted as a character element is generated as an image, and is defined as a character string in a character recognizing process.
To correctly extract a headline, precise integration technology for a black picture element area belonging to the same character string is required.
The following conventional method relates to this technology.
After performing a pre-process such as adjusting a tilt, removing a character-box line, etc., the entire image is labelled, and an overlapping nest integrating process is performed on the obtained black picture element connection areas. Then, the character size of the text of the entire document is determined from the obtained black picture element connection area. Based on the value, the attribute of each connection area is determined. When it is determined that the attribute of a rectangle is a character, vertical or horizontal integration is repeated on the rectangle, thereby defining a character string.
However, in the conventional technology, the character color is determined during the character extracting process, and the character line width is fixed to a standard value. Furthermore, a character string area is set in line units (or in column units). Therefore, it has been a problem that it is quite difficult to extract a character string from a complicated image comprising a background pattern containing a combination of white and black portions, various types of fonts, a color document, a plurality of lines, a combination of vertical and horizontal character strings, and a compound of them.
Furthermore, the relationship between black character candidates and white character candidate in number is not a reliable criterion for determining a character color. When the character color is determined during the character extracting process, the determination is irrevocable if it is mistakenly made, thereby failing in the final character recognition.
Additionally, when the character line width is fixed to a standard value, a character element printed by a comparatively thin line can be easily lost, thereby failing in the final character recognition.
Furthermore, since the overlapping nest integrating process is performed on black picture element connection areas in the conventional technology, portions which should not be originally integrated are sequentially integrated, thus finally and incorrectly integrating the entire document.
For example, when the tilt of the entire document cannot be adjusted, or the character-box line cannot be completely removed, the entire document can be integrated in the overlapping nest integrating process.
FIG. 1
shows an example of integrating the entire document in the conventional overlapping nest integrating process.
In
FIG. 1A
, it is assumed that the enclosing rectangles K
61
through K
65
of the connected elements have been obtained from an input image. When the overlapping nest integrating process is performed on the enclosing rectangles K
61
through K
65
of the connected elements, the enclosing rectangle K
61
and the enclosing rectangle K
62
are integrated because they overlap each other. As a result, as shown in
FIG. 1B
, an enclosing rectangle K
66
enclosing the enclosing rectangle K
61
and the enclosing rectangle K
62
is generated. When the enclosing rectangle K
66
is generated, the enclosing rectangle K
66
overlaps the enclosing rectangle K
63
. Therefore, they are integrated, and the enclosing rectangle K
67
encompassing the enclosing rectangle K
66
and the enclosing rectangle K
63
is generated as shown in FIG.
1
C. When the enclosing rectangle K
67
is generated, the enclosing rectangle K
67
overlaps the enclosing rectangle K
64
. Therefore, they are integrated. Similarly, all enclosing rectangles K
61
through K
65
shown in
FIG. 1A
are integrated, and the enclosing rectangle K
68
encompassing the enclosing rectangles K
61
through K
65
is generated as shown in FIG.
1
D.
Also, there is the problem that it takes too long a time to perform the overlapping nest integrating process when there is a headline with a photograph, drawing, or texture.
SUMMARY OF THE INVENTION
The first object of the pr
Boudreau Leo
Mariam Daniel G.
Staas & Halsey , LLP
LandOfFree
Character string extraction apparatus and pattern extraction... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Character string extraction apparatus and pattern extraction..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Character string extraction apparatus and pattern extraction... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3068952