Image analysis – Pattern recognition – Classification
Reexamination Certificate
1999-10-18
2003-05-13
Chang, Jon (Department: 2623)
Image analysis
Pattern recognition
Classification
C382S226000
Reexamination Certificate
active
06563952
ABSTRACT:
FIELD
The present invention relates to computer software for classifying data. In particular, the invention uses flattening and addition of attributes to perform classification of sparse high dimensional data to more accurately predict a data class based on data attributes.
BACKGROUND
Classification is the process of assigning a data object, based on the data object's attributes, to a specific class from a predetermined set. Classification is a common problem studied in the field of statistics and machine learning. Some well-known classification methods are decision trees, statistical methods, rule induction, genetic algorithms, and neural networks.
A classification problem has an input dataset called the training set that includes a number of entries each having a number of attributes (or dimensions). A training set with n possible attributes is said to be n-dimensional. The objective is to use the training set to build a model of the class label based on the attributes, such that the model can be used to classify other data not from the training set. The model often takes the form of a decision tree, which is known in the art.
An example of a typical classification problem is that of determining a driver's risk for purposes of calculating the cost of automobile insurance. A single driver (or entry) has many associated attributes (or dimensions), such as age, gender, marital status, home address, make of car, model of car, type of car, etc. Using these attributes, an insurance company determines what degree of risk the driver imposes to the insurance company. The degree of risk is the resultant class to which the driver belongs.
Another example of a classification problem is that of classifying patients' diagnostic related groups (DRGs) in a hospital. That is, determining a hospital patient's final DRG based on the services performed on the patient. If each service that could be performed on the patient in the hospital is considered an attribute, the number of attributes (dimensions) is large but most attributes have a “not present” value for any particular patient because not all possible services are performed on every patient. Such an example results in a high-dimensional, sparse dataset.
A problem exists in that artificial ordering induced on the attributes lowers classification accuracy. That is, if two patients each have the same six services performed, but they are recorded in different orders in their respective files, a classification model would treat the two patients as two different cases, and the two patients may be assigned different DRGs.
Another problem that exists in classification pertaining to high-dimensional sparse datasets is that the complexity required to build a decision tree is high. There are often hundreds, even thousands or more, possible attributes for each entry. Thus, there are hundreds, or thousands, of possible attributes on which to base each node's splitting criterion in the decision tree. The large number of attributes directly contributes to a high degree of complexity required to build a decision tree based on each training set.
A goal of the invention is to provide a classification system that overcomes the identified problems.
SUMMARY
In one embodiment, the present invention provides a method and apparatus for classifying high-dimensional data. The invention performs classification by storing the data in a computer memory, flattening the data into a boolean representation, and building a classification model based on the flattened data. The classification model can be a decision tree or other decision structure. In one aspect of the invention, large itemsets are used as additional attributes on which to base the decision structure. In another aspect of the invention, clustering is performed to provide additional attributes on which to base the decision structure.
In another embodiment, the invention provides a method and apparatus for classifying high-dimensional data using nearest neighbor techniques. The data is stored in a computer memory, flattened into a boolean representation, and classified based on the m nearest neighbors of an entry.
An advantage of the invention is that flattening the data removes any artificial ordering introduced into the data as a result of non-uniform recording procedures, thus yielding more accurate results.
Another advantage of the present invention is that the use of additional attributes based on large itemsets and clustering improves the accuracy of the resulting decision tree on which classification is based. This is achieved by determining which itemsets are large itemsets, and then using large itemsets as additional attributes on which a tree node's splitting criterion might be based. Clustering may also be used to increase accuracy in building a decision structure.
REFERENCES:
patent: 5142593 (1992-08-01), Kasano
patent: 5325445 (1994-06-01), Herbert
patent: 6052483 (2000-04-01), Baird et al.
patent: 6229918 (2001-05-01), Toyama
patent: 6307965 (2001-10-01), Aggarwal et al.
Kim et al. “Hierarchical Classification in High Dimensional, Numerous Class Cases.” 10th Annual Int. Geoscience and Remote Sensing Symposium, May 1990, pp.2359-2362.*
Benediktsson et al. “Classification of Very High Dimensional Data Using Neural Networks.” 10th Annual Int. Geoscience and Remote Sensing Symposium, May 1990, pp. 1269-1272.*
Assa et al. “Displaying Data in Multidimensional Relevance Space with 2D Visualization Maps.” Proc. Visualization '97, Oct. 1997, pp. 127-134.*
Tu et al. “A Fast Two-Stage Classification Method for High-Dimensional Remote Sensing Data.” IEEE Trans. on Geoscience and Remote Sensing, vol. 36, No. 1, Jan. 1998, pp. 182-191.*
Jimenez et al. “Supervised Classification in High-Dimensional Space: Geometrical, Statistical, and Asymptotical Properties of Multivariate Data.” IEEE Trans. on Systems, Mand, and Cybernetics-Part C: Applications and Reviews, vol. 28, No. 1, Feb. 1998, pp. 39.*
Agrawal et al. “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications.” Proc. of 1998 ACM SIGMOD Int. Conf. on Management of Data, 1998, pp. 94-105.
Ramkumar G. D.
Ranka Sanjay
Singh Vineet
Srivastava Anurag
Chang Jon
Dorsey & Whitney LLP
Hitachi America Ltd.
LandOfFree
Method and apparatus for classification of high dimensional... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for classification of high dimensional..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for classification of high dimensional... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3088630