Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2001-04-04
2003-11-04
Alam, Shahid Al (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C704S254000, C715S252000
Reexamination Certificate
active
06643647
ABSTRACT:
CROSS-REFERENCE TO RELATED APPLICATIONS
This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2000-102370, filed Apr. 4, 2000, the entire contents of which are incorporated herein by reference.
BACKGROUND OF THE INVENTION
This invention relates to a word string collating apparatus and word string collating method for collating a word string such as an address with addresses in an address dictionary when the word string is extracted from a character recognition result which may contain an error in the character recognition field in which a document inputting apparatus or an optical character reading apparatus for reading address information is used, for example, and an address recognition apparatus for recognizing the address.
For example, an apparatus for extracting only a word string of an address from a word string containing a destination address name, honorific title and the like written on an envelop by collating the word string with addresses in an address dictionary is proposed.
As this type of word string collating apparatus, an address collating apparatus for collating words based on a distance between words, for example, an apparatus for effecting a word collating process based on an edition distance (Levenshtein distance, LD, V. Levenstein; Sov. Phys. Dokl. 10, 707) or the like is known in the art. The outline thereof is to derive distances (similarities) between an input word string and words in the address dictionary and effect the address collating process by using the derived distances as the measurement when a word string is extracted from a set of input character recognition results containing errors.
The prior art technique is explained in detail below.
FIG. 1
shows the construction of an address collating apparatus used as the conventional word string collating apparatus. In
FIG. 1
, an input section
1
converts a document image into a form which can be processed by a computer when receiving the document image (for example, it is a photoelectric conversion device such as an image scanner).
A character recognition section
2
performs processes such as the binary coding process, segmentation process and individual character recognition process for understanding the contents of the document image based on the input document image. In this case, a recognition result obtained in the character recognition section
2
is hereinafter referred to as a character recognition result. The character recognition technique has been studied for a long period of time, but a system capable of attaining the character recognition rate of 100% without fail cannot be realized except some restricted cases. Therefore, it is required in practice to provide means for correctly extracting a word string even if the character recognition result contains an error.
A word string forming section
3
forms a word string A based on the character recognition result in the character recognition section
2
and stores the same into a memory M
1
. The word string A is a set of character strings segmented in the unit of word.
For example, the word string A constructed by 15 words of “JOHN”, “WILLIAMS”, “MULTIPLE”, “DLSTRICT”, “C”, “1278”, “SHEIATON”, “STREEI”, “UNLT”, “5”, “RICHRTIONDHILL”, “ONTARLO”, “L4B”, “2N1” and “CANADA” are formed as shown in
FIGS. 3 and 4
based on the address of FIG.
2
.
In an address dictionary M
2
used as the word dictionary, a plurality of address data items (words) B
1
, B
2
, . . . are previously stored and desired data items can be read out at any time.
For example, as shown in
FIG. 3
, the address data B
1
including six word items of the street name “WILLIAMS”, street suffix “STREET”, city name “RICHMONDHILL”, state name “ONTARIO”, zip code (upper three digits) “L4B” and zip code (lower three digits) “2N1” is read out.
Further, as shown in
FIG. 4
, the address data B
2
including six word items of the street name “SHERATON”, street suffix “STREET”, city name “RICHMONDHILL”, state name “ONTARIO”, zip code (upper three digits) “L4B” and zip code (lower three digits) “2N1” is read out.
A distance calculating section
11
calculates a distance C
LD
between words by use of the word string A and address data B
1
and stores the distance in a memory M
4
. The distance C
LD
between the words can be variously defined and an edition distance (which is also called a Levenshtein distance and is hereinafter simply referred to as LD) is given as one example thereof. LD indicates the minimum value of the number of operations of replacement, insertion and deletion of characters required for converting the word string A into the other word string B
1
. The operation is expressed by the following equation.
L
D
(
A,B
1)=
min{pa
(
i
)+
qb
(
i
)+
rc
(
i
)}
where a(i) indicates a certain number of replacing operations, b(i) indicates a certain number of insertion operations, and c(i) indicates a certain number of deletion operations. Further, p, q, r are weighting factors used for the edition operation of replacement, insertion and deletion and depend on appearing characters. Generally, since the number of combinations of a(i), b(i), c(i) is limitless, the minimum value of LD(A,B
1
) is derived by use of the dynamic programming method (D
p
).
An optimum solution deriving section
12
selects one of a plurality of address data items B
1
, B
2
, . . . which has the minimum distance C
LD
with respect to the word string A and provides the selected address data as the optimum solution.
An output section
10
converts the thus acquired optimum solution into a form which the user can understand and outputs the thus converted address data and is a display device, for example.
Conventionally, since only the distance (similarity) C
LD
between the words is used to perform the address collating process, there occurs a possibility that erroneous address data is selected as the optimum solution rather than correct address data. The operation is explained with reference to
FIGS. 2
,
3
,
4
.
FIGS. 3 and 4
show address collating methods based on the conventional method by taking an address (imaginary) in Canada as an example. An input document image is shown in FIG.
2
. In this example, “John Williams/Multiple District C/1278 Sheraton Street Unit 5/Richmondhill ONTARLO L4B 2N1 CANADA” is written. The result obtained by processing the document image by use of the character recognition section
2
and word string forming section
3
is a word string containing a character error. As described before, a character error is contained in the word string.
In this case, the recognized characters are all converted into capital letters (no distinction between capital letters and small letters). As shown in
FIGS. 3 and 4
, the word string containing the character error is “JOHN-WILLIAMS-MULTIPLE-DISTRICT-C-1278-SHEIATON-STREEI-UNLT-5-RICHRTIONDHILL-ONTARLO-L4B-2N1-CANADA”.
In the address dictionary M
2
, a plurality of address data items B
1
, B
2
, . . . are previously stored. In order to simplify the explanation, only two address data items including the first address data B
1
“WILLIAMS-STREET-RICHIMONDHILL-ONTARIO-L4B-2N1” as shown in FIG.
3
and the second address data B
2
“SHERATON-STREET-RICHIMONDHILL-ONTARIO-L4B-2N1” as shown in
FIG. 4
are provided. The items in each of the address data items sequentially and respectively indicate the street name, street suffix, city name, state name, postal code (upper three digits) and postal code (lower three digits) from the head portion.
The distance calculating section
11
compares the word string A with the first address data B
1
and the second address data B
2
. The method is to derive a word having the minimum distance (maximum similarity) for each item in the address data B
1
(B
2
). In the case shown in
FIGS. 3 and 4
, the distance between the words is derived based on LD and the similarity is derived according to the following equation (1).
SIMILARITY
=
1
LD
′
⁡
(
A
,
B
)
+
ε
⁢


⁢
LD
′
⁡
(
A
,
B
)
=
LD
⁡
(
A
Al Alam Shahid
Kabushiki Kaisha Toshiba
Pillsbury & Winthrop LLP
LandOfFree
Word string collating apparatus, word string collating... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Word string collating apparatus, word string collating..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Word string collating apparatus, word string collating... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3127353