Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1997-04-11
2001-10-02
Black, Thomas (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C370S465000, C706S025000, C382S203000
Reexamination Certificate
active
06298351
ABSTRACT:
TECHNICAL FIELD
This invention relates, in general, to classification techniques, and, in particular, to modifying unreliable training sets for use in supervised classification.
BACKGROUND ART
Classification is one of the most important operators that is used for phenomenal (or similarity) searches in various image, video, and data mining applications. In a phenomenal search, a target pattern is usually classified according to a set of predefined classes. The target pattern can include, for instance, the spectral signature of a pixel from an image or video frame; the spatial signature of a block of an image or video frame defined by its texture features; the frequency signature of a time series such as stock index movement; or the spatial signature of 3D seismic data.
In order to achieve high classification accuracy, it is usually necessary to train a classifier with sufficient training data from each individual class. However, gathering reliable training data is usually difficult, if even feasible. As an example, the current United States land cover/land use maps were developed around the late 1960's by the United States Geology Survey (USGS). These maps are not completely accurate due to errors in the photointerpretation of the images used to create them, their limited resolution and inaccuracies in geolocation. Additional errors arise when using these maps as source of ground truth in conjunction to more recent images to train the classifier, due to various natural and artificial land cover transformation. As a result, the accuracy of the classifier suffers.
Similarly, classifying video, time series, and 3D seismic data could also encounter unreliable training data.
One way of generating more reliable training data typically involves clustering the data using one of the unsupervised classifiers or vector quantization methods. A human expert then labels the clusters manually. This methodology is appropriate, however, only for generating a small set of training data, since it requires human intervention. Furthermore, it does not automatically incorporate preexisting classified data even though those preclassified data may not be completely accurate.
Other techniques for generating training data include the discarding of outliers. These approaches invariably address those samples that appear to be a statistical anomaly. However, these approaches cannot deal with the situations when the training set is either mislabeled or changed.
Based on the foregoing, a need exists for a training set that is reliable and fully useable. Additionally, a need exists for a technique that allows the modification of an unreliable training set.
SUMMARY OF THE INVENTION
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a method for modifying a training set for use in data classification. The method includes, for example, determining at least one datum of the training set is incorrect and reconstructing the at least one datum to provide a modified training set.
In one embodiment of the invention, the reconstructing includes modifying a label associated with the at least one datum to provide a correct label.
In a further embodiment of the invention, the training set includes a plurality of data, each with a corresponding label, and the determining includes dividing the plurality of data into a plurality of groups, and applying one or more rules to at least a portion of the data of at least one group to determine if any of the corresponding labels of the at least one portion of the data is incorrect.
In a further embodiment of the invention, the reconstructing includes constructing a contingency table for the data of the plurality of the groups, creating a histogram from the contingency table, identifying any regions of low confidence from the histogram, and modifying labels associated with data identified to be within a region of low confidence.
In a further aspect of the invention, a system of modifying a training set for use in data classification is provided. The system includes a means for determining at least one datum of the training set is incorrect and a reconstruction unit adapted to reconstruct the at least one datum of the training set to provide a modified training set.
In yet another aspect of the invention, an article of manufacture is provided. The article of manufacture includes a computer useable medium having computer readable program code means embodied therein for causing the modification of a training set for use in data classification. The computer readable program code means in the article of manufacture includes computer readable program code means for causing a computer to effect determining at least one datum of the training set is incorrect, and computer readable program code means for causing a computer to effect reconstructing the at least one datum of the training set to provide a modified training set.
The capability of the present invention provides for reliable training sets. Additionally, it improves the performance of classification techniques, such as supervised classification techniques, which utilize the training set for deriving classification rules.
Additional features and advantages of the invention are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.
REFERENCES:
patent: 4719571 (1988-01-01), Rissanen et al.
patent: 5136686 (1992-08-01), Koza
patent: 5335291 (1994-08-01), Kramer et al.
patent: 5481649 (1996-01-01), Birdwell et al.
patent: 5555317 (1996-09-01), Anderson
patent: 5559929 (1996-09-01), Wasserman
patent: 5634087 (1997-05-01), Mammone et al.
patent: 5684929 (1997-11-01), Cortes et al.
patent: 5687364 (1997-11-01), Saund et al.
patent: 5727081 (1998-03-01), Burges et al.
patent: 5727199 (1998-03-01), Chen et al.
patent: 5734893 (1998-03-01), Li et al.
patent: 5768422 (1998-06-01), Yaeger
patent: 5799301 (1998-08-01), Castelli et al.
patent: 5870399 (1999-02-01), Smith
patent: 5956739 (1999-09-01), Golding et al.
patent: 6067535 (2000-05-01), Hobson et al.
patent: 6111983 (2000-08-01), Fenster et al.
Castelli et al., “The Relative Value of Labeled and Unlabled Samples in Pattern Recognition with an Unknown Mixing Parameter,” International Symposium on Information Theory, Norway, p. 2103-2117, Dec. 1999.*
Castelli et al., “The Relative Value of Labeled and Unlabled Samples in Pattern Recognition,” International Symposium on Information Theory, IEEE, p. 355-355, 1993.*
Castelli et al., “Classification Rules in the Unknown Mixture Parameter Case; Relative Value of Labeled and Unlabled Samples,” International Symposium on Information Theory, IEEE, p. 111, 1994.*
Castelli et al., “Classification Rules in the Unknown Mixture Parameter Case Relative Value of Labeled and Unlabeled Samples,” Information Theory, Jun. 1994, Proceedings of IEEE, p. 111.*
Pinciroli et al., “A Technological Environment and a Software Product for Teaching Dynamic Electrocardiography,” Computers in Cardiography, 1988, Proceedings, pp. 473-476.*
Li et al., “HierarchyScan: A Heirarchical Similarity Search Algorithm Databases of Long Sequences,” Data Engineering, Proceedings, p. 546-553, Feb. 1996.*
Castelli et al., “Progressive Classification in the Compressed Domain for EOS Satellite Databases,” Acoustics, Speech, and Signal Processing, IEEE, pp. 2199-2202, vol. 2, May 1996.*
Duda et al., Pattern Classification and Scene Analysis, “Bayes Decision Theory” (Chapter 2), pp. 10-13, Wiley & Sons (1973).
Duda et al., Pattern Classification and Scene Analysis, “Parameter Estimation And Supervised Learning”, (Chapter 3), pp. 44-45, 76-79, Wiley & Sons (1973).
Duda et al., Pattern Classification and Scene Analysis, “Unsupervised Learning And Clustering”, (Chapter 6), p. 189-191, Wiley & Sons (1973).
“Progressive Classification In The Compressed Domain For Large EOS Satellite Databases”, by Vittorio Castelli, Chung-Sheng Li, John Turek, Ioannis Knotoyiannis, IEEE 1996, p. 104.
Castelli Vittorio
Hutchins Sharmila Thadhani
Li Chung-Sheng
Turek John Joseph Edward
Black Thomas
Ellenbogen, Esq. Wayne L.
Heslin & Rothenberg, P.C.
International Business Machines - Corporation
Radigan, Esq. Kevin P.
LandOfFree
Modifying an unreliable training set for supervised... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Modifying an unreliable training set for supervised..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Modifying an unreliable training set for supervised... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2580886