Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-08-23
2002-10-22
Rones, Charles L. (Department: 2175)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C717S146000, C382S103000, C382S190000, C382S229000
Reexamination Certificate
active
06470336
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a document search device for searching for a keyword based on a recognition result obtained by character recognition of a document image and a recording medium having a document search program stored thereon.
2. Description of the Related Art
In general, in order to accumulate a document in the form of paper in an electronic document data base, the document in the form of paper is read as image data, and character recognition of the data is performed to convert the data into a collection of electronic character codes (character recognition result). Thus, the document is accumulated in the document data base as the collection of character codes. In order to search for a keyword from the document data base, it is determined whether the keyword is included in the character recognition result. In the case of generally used character recognition some of the characters written in the original document (document in the form of paper) may not be correctly converted into character codes. When such an error occurs in the character recognition, the characters represented by the character codes may be different from the characters in the original document. In this case, when a search for a keyword is performed in the collection of character codes accumulated in the document data base, a search omission may possibly occur. The phrase “search omission” is defined to indicate that a character string is not detected as a result of the search for a keyword despite that the original document includes a character string which corresponds to the keyword.
A known technology for preventing the search omission is described in, for example, Japanese Laid-Open Publication No. 7-152774.
In accordance with the technology described in Japanese Laid-Open Publication No. 7-152774, an expanded character string is developed at the time of search, using a similar character list for a character or characters, among the characters included in the keyword, which are easily mistaken for other character(s). The similar character list includes a plurality of characters which can be mistaken for the above-mentioned character(s). These character(s) are easily mistaken since there are other characters having similar shapes thereto.
The conventional technology described in Japanese Laid-Open Publication No. 7-152774 will be described with reference to
FIGS. 24A and 24B
.
FIG. 24A
shows a case in which characters “
(‘hon’)” and “□(‘koh’)” includes in an original document are respectively converted into characters “
(‘ki’)” and “
(‘ku’)” having similar shapes thereto by an error in character recognition. The character recognition result is a collection of character codes, but in
FIG. 24A
, the character codes are shown by the characters corresponding to the character codes for easier understanding. Although the original document includes keyword “
(‘nihon’)”, a search omission occurs when keyword “
(‘nihon’)” is searched for using the character recognition result.
FIG. 24B
shows an example of a similar character list. Row 99-1 shows that the character “
(‘hon’)” is easily mistaken for characters “
(‘ki’)”, “
(‘dai’)”, “
(‘futo’)” and “
(‘sai’)”. Row 99-2 shows that the character “
” is easily mistaken for characters “□” (square symbol), “
(‘kai’)”, “
(‘en’)” and “
(‘nado’)”.
In accordance with the conventional technology described in Japanese Laid-Open Publication No. 7-152774, keyword “
(‘nihon’)” is searched for in the following manner. Using the similar character list shown in
FIG. 24B
, developed character strings “
(‘nichiki’)”, “
(‘nichidai’)”, “
(‘nichifuto’)” and “
(‘nichisai’)” are created. When keyword “
(‘nihon’)” is searched for using the character recognition result, the developed character strings “
(‘nichiki’)”, “
(‘nichidai’)”, “
(‘nichifuto’)” and
“(‘nichisai’)” are also used as the keyword. Thus, “
(‘nichiki’)” which has been mistakenly converted from “
(‘nihon’)” by character recognition can be found.
By this technology disclosed by Japanese Laid-Open Publication No. 7-152774, when a character included in the document is mistaken for a character which is not included in the similar character list, a search omission cannot be avoided. For example, it is assumed that keyword “
(‘jinkoh’)” is searched for using the character recognition result shown in FIG.
24
A. Character “
(‘ku’)”, which is mistakenly converted from character “
(‘koh’)” is not included in the similar character list for character “
(‘koh’)” shown in row 99-2 of FIG.
24
B. Therefore, developed character string “
(‘jinku’)” is not searched for, and thus a search omission occurs.
In order to reduce the undesirable possibility of such a search omission, the number of characters included in the similar character list can be increased. However, this increases the number of developed character strings and thus raises the costs (i.e., time and calculation amount) for the search.
SUMMARY OF THE INVENTION
According to one aspect of the invention, a document search device for searching for a keyword in a recognition result obtained by character recognition performed on a document image is provided. The keyword includes at least one first character, and a character code is assigned to each of the at least one first character. The recognition result includes at least one second character, and a character code and a partial area of the document image are assigned to each of the at least one second character. The document search device includes a first matching portion specification section for determining whether or not the recognition result includes at least one first matching portion which matches the keyword based on a comparison of the character code assigned to the at least one first character with the character code assigned to the at least one second character, and for specifying the at least one first matching portion when the recognition result includes the at least one first matching portion; a first portion specification section for determining whether or not a remaining part of the recognition result other than the at least one first matching portion includes at least one first portion which fulfills a prescribed first condition, and for specifying the at least first portion when the remaining part includes the at least first portion; and a second matching portion specification section for determining whether or not the at least one first portion includes at least one second matching portion which matches the keyword based on a comparison of a feature amount of the partial area of the document image associated to the at least one second character included in the at least one first portion with a feature amount of an image of at least one first character included in the keyword, and for specifying the at least one second matching portion when the at least one first portion includes the at least one second matching portion. The prescribed first condition includes a condition that the at least one first portion is in the vicinity of a specific second character having a width smaller than a prescribed value.
In one embodiment of the invention, the second matching portion specification section includes a first determination section of determining whether or not the character code of a specific second character included in the at least one first portion matches the character code of a specific first character included in the keyword; a non-matching character specification section for, when the character code of the specific second character included in the at least one first portion does not match the character code of the specific first character included in the keyword, specifying one second character or two or more continuous second characters which include at least the specific second character included in the at lest one first portion and has a width closest to a width of the specific first character as a non-matching character, and a second determination section for, when a distance between a feature amount of an image of the spec
Imagawa Taro
Kondo Kenji
Matsukawa Yoshihiko
Mekata Tsuyoshi
Matsushita Electric - Industrial Co., Ltd.
Renner Otto Boisselle & Sklar
Rones Charles L.
LandOfFree
Document image search device and recording medium having... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Document image search device and recording medium having..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Document image search device and recording medium having... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2999387