Validation of nearest neighbor classifiers

Data processing: artificial intelligence – Machine learning

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C706S020000, C711S173000

Reexamination Certificate

active

06732083

ABSTRACT:

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
Not Applicable
REFERENCE TO A MICROFICHE APPENDIX
Not Applicable
BACKGROUND OF THE INVENTION
A classifier is a process or device that categorizes inputs. For example, a quality-control classifier may take measurements of an item as inputs and return a result of “defective” or “not defective.” Another example could be an iceberg-detecting classifier that takes as input a satellite image of an ocean area and produces as output either “iceberg” or “no iceberg.”
As another example, consider an email router system designed to be deployed in a crisis relief situation. The system users can be divided into groups according to interests, including “medical”, “food”, and “transport.” For each group, a classifier could be designed to determine whether an input email is of interest to the group. When an email is received by the system, each classifier can be applied, and the classifier results can determine which groups are sent the email. For example, an email detailing when blood supplies will arrive at an airport should be routed to the members of the “medical” and “transport” groups.
A “nearest neighbor” classifier is based on a set of inputs for which the categories are known. These input-output pairs are called the “in-sample” examples. The nearest neighbor classifier also includes a means of determining some measure of distance between inputs. When an “out-of-sample” input is encountered, the nearest neighbor classifier determines which in-sample example has the closest input and returns the corresponding output.
For the email router example, a nearest neighbor classifier for the medical group could consist of email messages from previous crises, with each message categorized as either “medically interesting” or “not medically interesting.” These are the in-sample examples. The distance measure between pairs of emails could be based on the number of words shared by the emails. To determine whether an out-of-sample email should be routed to medical personnel, the classifier would find the closest in-sample email. If that in-sample email is categorized as “medically interesting”, then the classifier would categorize the out-of-sample email as “medically interesting.”
Accuracy is a major concern in the development, selection, and decision whether to deploy a classifier. The probability that a classifier will incorrectly classify an input drawn at random is known as the expected error rate. (The expected error rate is also called the error probability.) Methods to estimate the expected error rate are called validation techniques. Effective validation techniques are needed so that classifiers can be selected for low expected error rates and so that classifier results can be used with appropriate levels of trust.
Now we describe prior art. One method to validate a nearest neighbor classifier is to remove some examples from the in-sample data to form a validation data set, use the remaining examples to form the classifier, then use the error rate of the classifier over the validation data as an estimate of the expected error rate. The drawback of this method is that the resulting classifier does not include the validation examples. On average, having fewer examples in a classifier results in a higher expected error rate. Hence, the primary goal of forming an accurate classifier suffers in order to accomplish the secondary goal of assessing the accuracy.
There are two prior validation techniques to obtain probabilistic bounds on the expected error rates for classifiers formed using all in-sample examples. Both methods are based on using in-sample examples to validate classifiers based on subsets of the remaining in-sample examples, then bounding the difference in error rates between the classifiers consisting of subsets of the in-sample examples and classifiers consisting of all in-sample examples. One method is based on the “leave-one-out” estimate, which is the error rate of applying the classifier consisting of the remaining in-sample examples to each in-sample example. This method is described in “Distribution-Free Inequalities for the Deleted and Holdout Estimates” by L. Devroye and T. Wagner, published in IEEE Transactions on Information Theory, Volume 25, 1979, pages 202 to 207, which is included in its entirety by reference. The other method is based on the “holdout” estimate, which is the error rate of applying the classifier consisting of a subset of the in-sample examples to the remaining in-sample examples. This method is also described in the paper “Distribution-Free Inequalities for the Deleted and Holdout Estimates.”
BRIEF SUMMARY OF THE INVENTION
The invention is a computer-based system to obtain probabilistic bounds on the expected error rate for a nearest neighbor classifier. The estimate used in the bound is an extension of the holdout estimate. The difference in error rates between the holdout classifier and classifier consisting of all in-sample examples is estimated using truncated inclusion and exclusion.
The bounds are computed as follows. Partition the in-sample examples into validation examples and remaining examples. Form a reduced classifier from the remaining examples. Use the error rate of the reduced classifier over the validation examples to estimate the expected error rate of the reduced classifier. Then estimate the difference in expected error rates between the reduced classifier and the full classifier as follows. Partition the validation examples into two “holdout” subsets. Then use each holdout subset to estimate the difference in expected error rate due to including the other holdout subset in the full classifier.
The bound is based only on the examples that comprise the classifier. Thus, all available examples can be used in the classifier; no examples need to be withheld to compute error bounds. Also, the probability that the prior “leave-one-out” and “holdout” error bounds are weaker than those produced by the process described here is shown to be large in “Validation of Nearest Neighbor Classifiers” by Eric Bax, published in IEEE Transactions on Information Theory, Volume 46, Number 7, 2000, pages 2746 to 2752, which is included in its entirety by reference.


REFERENCES:
patent: 6298351 (2001-10-01), Castelli et al.
patent: 2002/0054694 (2002-05-01), Vachtsevanos et al.
Distribution-Free Inequalities for the Deleted and Holdout Estimates, L. Devroye and T. Wagner, published in IEEE Transactions on Information Theory, vol. 25, 1979, pp. 202-207.*
Validation of Nearest Neighbor Classifiers, Eric Bax, published in IEEE Transactions on Information Theory, vol. 46, No. 7, 2000, pp. 2746-2752.*
The Uniform Convergence of Nearest Neighbor Regression Function Estimators and Their Application in Optimization, Luc P. Devroye, IEEE Transactions On Information Theory, vol. IT-24, No. 2, Mar. 1998, pp. 142-151.*
Lower Bounds for Bayes Error Estimation, Andres Antos, Luc Devroye, Laszio Gyorti, Fellow, IEEE, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, No. 7, Jul. 1999, pp. 643-645.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Validation of nearest neighbor classifiers does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Validation of nearest neighbor classifiers, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Validation of nearest neighbor classifiers will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3269582

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.