Image analysis – Pattern recognition – Feature extraction
Reexamination Certificate
1997-07-07
2001-12-04
Boudreau, Leo (Department: 2621)
Image analysis
Pattern recognition
Feature extraction
C382S173000, C382S175000, C382S202000, C382S203000, C382S282000, C382S286000, C382S291000, C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06327387
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a system for converting documents and drawings into image data through an input device such as a scanner, etc., adding management information to the image data, and accumulating resultant data; to an apparatus for identifying the structure of the ruled lines in the image for image recognition; and to a method of performing the above described processes.
2. Description of the Related Art
Recently, a conventional method of storing information on paper has been switched to a method of storing data on electronic media. For example, an electronic filing system converts documents stored on paper into document images by an opto-electrical converter such as an image scanner, etc. and stores the converted document images on an optical disk, a hard disk, etc. with management information such as a keyword for retrieval added to the converted document images.
Since documents are stored as image data in the above described method, a larger disk capacity is required than in a method in which all characters in documents are stored after being encoded in a character recognition technology. However, the above described method can be easily followed at a high process speed, and pictures and tables containing data other than characters can be stored as is. On the other hand, the stored information should be retrieved using additional management information such as a keyword, numbers, etc. together with document images. The conventional systems require much effort and time in assigning a keyword, and do not bring user-friendly technology.
To solve the problem of the awkwardness of the conventional systems, the title of a document can be assumed to be a keyword, automatically extracted, recognized as characters, and encoded for storage with document images.
At present, the speed of recognizing characters is up to several tens of characters per second, and it takes about 30 seconds through several minutes to process a normal document page (approximately 21 cm×29.5 cm). Therefore, it is recommended not to recognize all characters of an entire document, but to first extract necessary titles from the images of the document and then recognize them.
The conventional technology of extracting a part of a document, for example, a title of the document from a document image obtained by reading the document through an opto-electrical converter is described in “TITLE EXTRACTING APPARATUS FOR EXTRACTING TITLE FROM DOCUMENT IMAGE AND METHOD THEREOF, U.S. patent application Ser. No. 08/694,503, now U.S. Pat. No. 6,035,061 issued Mar. 7, 2000 and Japanese Patent Application H7-341983” filed by the Applicant of the present invention.
FIG. 1A
shows the principle of the title extracting apparatus.
The title extracting apparatus shown in
FIG. 1A
comprises a character area generation unit
1
, a character string area generation unit
2
, and a title extraction unit
3
. The character area generation unit
1
extracts, by labelling connected components of picture elements, a partial pattern such as a part of a character, etc. from a document image input through a scanner, etc. Then, it extracts (generates) a character area by integrating several partial patterns. The character string area generation unit
2
integrates a plurality of character areas and extracts (generates) a character string area. The title extraction unit
3
extracts as a title area a character string area which is probably a title.
At this time, the title extraction unit
3
utilizes notable points such as a top and center position, a character size larger than that of the body of the document, an underlined representation, etc. as the probability of a title area. The probability is expressed as a score for each of the character string areas to finally obtain a plurality of candidates for the title area in the order from the highest score to the lowest one. In the above described process, title areas can be extracted from documents containing no tables.
On the other hand, when a document contains a table, the title extraction unit
3
extracts a title area in consideration of the condition of the number of characters after the character string area generation unit
2
extracts a character string area in the table. For example, the number of characters indicating the name of an item implying the existence of the title is comparatively small such as ‘Subject’, ‘Name’, etc. The number of characters forming a character string representing the title itself is probably large such as ‘ . . . relating to . . . ’ Thus, a character string which is probably a title can be detected from adjacent character strings by utilizing the number of characters in the character strings.
However, there are a large number of table-formatted documents using ruled lines such as slips, etc. Therefore, the above described conventional technology has the problem that there is little probability that a title can be successfully extracted from a table.
For example, when a title is written at the center or around the bottom in a table, the title may not be correctly extracted only by extracting character strings from the top by priority. Furthermore, as shown in
FIG. 1B
, an approval column
11
is located at the top in the table. If there are a number of excess character strings such as ‘general manager’, ‘manager’, ‘sub-manager’, ‘person in charge’, etc. in the approval column
11
, then these character strings are extracted by priority, thereby failing in correctly extracting the title.
As shown by a combination of an item name
12
and a title
13
, a title may be written below the item name
12
, not on the right hand side of the item name
12
. In this case, the relative positions of the item name and the title cannot be recognized only according to the information about the number of characters of adjacent character strings. Furthermore, item names are written not only horizontally but also vertically in Japanese. Therefore, it is very hard to correctly specify the position of the item name. When a document contains two tables, the title may be located somewhere in a smaller table.
Since a document containing tables can be written in various formats, the probability of a title depends on each document, and the precision of extracting a title in a table is lowered. If the state of an input document image is not good, the extraction precision is furthermore lowered.
In an electronic filing system, an extracted title area is character-recognized by an optical character reader (OCR) to generate a character code and add it to the image as management information. Thus, the image in a database can be retrieved using a character code.
In this case, there is no problem if the character string in a title area is readable by an OCR. However, if a background shows a textured pattern or characters are designed fonts, then the current OCR cannot recognize a character string. Therefore, in this case, management information cannot be added to an image.
SUMMARY OF THE INVENTION
The present invention aims at providing an apparatus and method of extracting appropriate management information for use in managing an image in a document in various formats, and an apparatus and method of accumulating images according to the management information.
An image management system having the management information extraction apparatus and the image accumulation apparatus according to the present invention includes a user entry unit, a computation unit, a dictionary unit, a comparison unit, an extraction unit, a storage unit, a group generation unit, and a retrieval unit.
According to the first aspect of the present invention, the computation unit computes the position of the management information contained in an arbitrary input image according to the position information about the position of a ruled line relative to the outline portion of a table area contained in the input image. The extraction unit extracts the management information from the input image based on the position computed by the computation
Katsuyama Yutaka
Naoi Satoshi
Takebe Hiroaki
Boudreau Leo
Fujitsu Limited
Mariam Daniel G.
Staas & Halsey , LLP
LandOfFree
Apparatus and method for extracting management information... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Apparatus and method for extracting management information..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus and method for extracting management information... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2579991