Automatic method for developing custom ICR engines

Image analysis – Pattern recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S187000

Reexamination Certificate

active

06571013

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to a computer automated method for creating image recognition engines optimized for a given language and/or source; and more particularly to a computationally efficient method for selecting, from a large universe of features, subsets of features that optimize recognition accuracy against a target data set. While not limited thereto, the invention will be explained in its preferred context for use in recognition of hand printed or machine printed characters.
2. Description of the Prior Art
As will be appreciated by those skilled in the art, each new language studied (and research in character recognition generally), proposes new character features to address specific problems or attributes of the source data set. However, the addition of new features is not necessarily sufficient for improved performance of a character recognition engine. If too many features are used, the rate of successful character discrimination actually drops as more features are added, a phenomena known in the art as peaking.
Each language (Japanese, Thai, English, et al.) has unique characteristics. Traditionally image character recognition (ICR) engines have been developed that make use of a feature set which was hand developed to support the language. The quality of the ICR is directly related to the amount of research effort applied to the language. High quality ICR engines exist where large markets support a large investment in research and development. An automated method that selects feature subsets would allow inexpensive development of ICR engines that perform well against languages that economically would not support a large investment.
Then, too, mixed languages can present unique problems as an OCR engine may need different features to distinguish between the two (or more) languages than the feature set used to best recognize either language individually. An automated feature selection tool would generate a feature set that is tailored to handle the particular mix of languages involved.
The following is a definition of certain of the terms used herein:
Class: All samples of a given codepoint (character) within the training data set.
Subclass: The set of samples within a single class which share some common attribute(s).
Codepoint: The representation to a machine of what people would recognize as a character.
Feature Vector: The set of measurements for the feature universe for a single sample ({right arrow over (v)})
S
+
: The class under consideration.
S

: The set of all classes within the problem space other than the codepoint under consideration (S
+
).
Exclusive: Used to describe a binary vector. Exclusive indicates that the binary vector is distinct from all vectors within S

(eq. 1).
{right arrow over (v)}&Lgr;{right arrow over (S)}≠{right arrow over (v)}∀{right arrow over (S)}&egr;S

  (1)
 where:
&Lgr; is the “and” operator,
∀ is the “all” operator, and
&egr; means exists.
G: A subset of S
+
, also known as a “view”
&agr;(G): The binary vector resulting from the logical conjunction of the samples represented by G.
&THgr;(S
+
,S

): The collection of all G subsets in S
+
such that &agr;(G) is exclusive against all vectors in S

.
&OHgr;(S
+
,S

): The collection of all maximal subsets in &THgr;(S
+
,S

) where maximal is defined as every H in &THgr;(S
+
,S

) such that if &agr;(H) is not exclusive against some &agr;(G) in &THgr;(S
+
,S

) then H=G.
MAX: Number of samples within the largest subclass occurring in &OHgr;(S
+
,S

) (eq. 2).
Max

(
Ω
)
=
max
G

Ω

&LeftBracketingBar;
G
&RightBracketingBar;
&LeftBracketingBar;
S
+
&RightBracketingBar;
(
2
)
 where: |G| is the number of elements in set G.
AVE: The average number of samples within the subclasses occurring within &OHgr;(S
+
,S

) (eq. 3).
Ave

(
Ω
)
=
1
M


G

Ω


&LeftBracketingBar;
G
&RightBracketingBar;
&LeftBracketingBar;
S
+
&RightBracketingBar;



where



M
=
&LeftBracketingBar;
Ω
&RightBracketingBar;
(
3
)
SUMMARY OF THE INVENTION
An object of this invention is the provision of a computer automated method to select a subset of features in order to limit the number of features used by the character recognition engine, while optimizing its ability to discriminate among characters. That is, a method that narrows the number of features to those features that provide best within class consistency of recognition and the best cross-class discrimination.
Another object of the invention is the provision of a computer automated method of feature selection that is suitable for use with a large number of classes (i.e. 1000+characters) and a large number of features (i.e. 100+features).
A further object of this invention is the provision of a computer automated feature selection method in which the selection program is executed in a distributed operation on multiple processors.
Briefly, this invention contemplates the provision of a computer automated feature selection method based upon the evaluation of hyper-rectangles and the ability of these rectangles to discriminate between classes. The boundaries of the hyper-rectangles are established upon a binary feature space where each bit indicates the relationship of a real feature value to a boundary within the minimum and maximum values for the feature across all samples. Data reduction combines the binary vector spaces so that the number of samples within a single class is within a range which is computationally feasible. Identification of subclasses identifies maximal subsets of S
+
which are exclusive against S

. Feature evaluation determines within a single subclass the contribution of each feature towards the ability to discriminate the subclass from S

. The base algorithm examines each feature, dropping any feature which does not contribute towards discrimination. A pair of statistics are generated for each remaining feature. The statistics represent a measure of how many samples from the class are within the subclass and a measure of how important each feature is to discriminating the subclass from S

. The values for each subclass are then combined to generate a set of values for the class. These class feature metrics are further merged into metrics evaluating the features contribution across the entire set of classes. Feature reduction determines which features contribute the least across the entire set of classes. These features will be removed from consideration. The algorithm drops features if they fail to reach a predetermined significance level. If all features are found to be significant then the feature with the lowest contribution metric is discarded. Finally, peaking determination is used to determine if the process should be repeated against the reduced feature space. The peaking determination is done by examining the rate of change within the significance metrics.
The basic algorithm is set forth in two articles by M. Kudo and M. Shimbo: “Feature Selection Based on the Structural Indices of Categories,”
Pattern Recognition
26 (1993) 891-901, and “Optimal Subclasses with Dichotomous Variables for Feature Selection and Discrimination,”
IEEE Trans. Syst. Man Cybern
. 19 (1989) 1194-1199, which are incorporated herein by reference.


REFERENCES:
patent: 5054094 (1991-10-01), Barski
patent: 5317652 (1994-05-01), Chatterjee
patent: 5321773 (1994-06-01), Kopec et al.
patent: 5526444 (1996-06-01), Kopec et al.
patent: 5594809 (1997-01-01), Kopec
Mori et al., “Historical Review of OCR Research and Development”, Proceedings of the IEEE, vol. 80, No. 7, 7/92, pp. 1029-1057.
Patrenahalli et al., “A Branch and Bound Algorithm for Features Subset Selection”, IEEE Transactions on Computers, vol. C-26, No. 9, 9/77, pp917-922.
Okada et al., “An Optimal Orthonormal Syst

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Automatic method for developing custom ICR engines does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Automatic method for developing custom ICR engines, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Automatic method for developing custom ICR engines will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3091656

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.