Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-04-25
2003-01-07
Mizrahi, Diane D. (Department: 2175)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000
Reexamination Certificate
active
06505207
ABSTRACT:
FIELD OF THE INVENTION
The present invention is related to data transformation and dimensionality reduction techniques associated with databases and, more particularly, to methods and apparatus for performing data transformation and dimensionality reduction in a supervised application domain in accordance with both a class variable and feature variables.
BACKGROUND OF THE INVENTION
In recent years, data mining applications have increased the development of techniques for processing high dimensional data, since most data mining problems are now posed in the context of very high dimensional data. Data sets which are inherently high dimensional may include, for example, demographic data sets in which the dimensions comprise information such as the name, age, salary, and other features which characterize a person. Typically, such problems have a large number of characteristics or features associated with them which are represented in a particular form. However, it is typically well known in the prior art that high dimensionality is a curse to many database applications and algorithms. This is because, in high dimensional space, traditional ways of defining similarity break down and cannot be effectively ascertained.
For this reason, it is always useful for database applications to be represented in a lower dimensional space using effective dimensionality reduction techniques. It is well known that database applications may be performed in either the “supervised” domain or the “unsupervised” domain. It is to be appreciated that supervised applications are those in which a special variable called the class variable exists, and the intent of the data mining application is to optimize various measures with respect to this special variable. For example, we may have a classification application in which the features variables comprise the different demographic attributes such as age, sex, salary, etc., and the class variable comprises people who have donated to charities in the past year. Then, this database may be used in order to model and determine the demographic behavior of regular donors. Such a software system may be used by a charitable organization to send mailers to all those people who are most likely to be donors. Such a problem is said to be a classification problem, and is considered “supervised” since it is focused around a special variable known as the class variable. On the other hand, there are many problems which are inherently “unsupervised.” Examples include clustering problems in which the demographic data is divided into clusters of similar people. In such cases, the data mining technique is not centered around any special variable and the clusters are found based on the variables listed in the demographic database.
Dimensionality reduction methods are often used for unsupervised applications. Techniques that have been effectively used in order to perform dimensionality reduction in large classes of applications in the unsupervised domain include, for example, singular value decomposition and KL (Karhunen Loeve) transform, see, e.g., C. Faloutsos, K.-I. Lin, “FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets,” Proceedings of the ACM SIGMOD Conference, 1995; and K. V. Ravi Kanth, D. Agrawal, A. Singh, “Dimensionality Reduction for Similarity Searching in Dynamic Databases,” Proceedings of the ACM SIGMOD Conference, 1998, the disclosures of which are incorporated by reference herein.
However, the existing techniques used in unsupervised problems for dimensionality reduction are not as effectively applicable to the supervised domain. This is because the dimensionality reduction in the unsupervised domain is focused only on the creation of a new set of feature variables which are mutually independent. In the supervised domain, however, dimensionality reduction has stronger implications since such applications include the use of the class variable for effective supervision. In supervised problems, which as mentioned above, use the set of feature variables and the class variable, the data is divided into two categories: the training data and the test data. The training data is used in order to develop the models which relate the feature variables to the class variable. For a given test example in which only the feature variables are known, it is desirable to find the class variable using the model which was constructed from the training data set. This problem is referred to as classification and has numerous applications in the literature including customer segmentation, target marketing and target mailing among others. Numerous techniques are known for building classification models in the prior art. These techniques include decision trees, DNF (Disjunctive Normal Form) rules, and neural networks, among others, see, e.g., Agrawal R., Ghosh S., Imielinski T., Iyer B., and Swami A., “An Interval Classifier for Database Mining Applications,” Proceedings of the 18th VLDB Conference, Vancouver, British Columbia, Canada 1992; Apte C, Hong S. J., Lepre J., Prasad S., and Rosen B, “RAMP: Rules Abstraction for Modeling and Prediction,” IBM Research Report RC 20271, June 1995; Quinlan J. R., “Induction of Decision Trees,” Machine Learning, Volume 1, Number 1, 1986; Shafer J., Agrawal R., and Mehta M., “SPRINT: A Scaleable Parallel Classifier for Data Mining,” Proceedings of the 22nd VLDB Conference, Bombay, India, 1996; Mehta M., Agrawal R., and Rissanen J., “SLIQ: A Fast Scaleable Classifier for Data Mining,” Proceedings of the Fifth International Conference on Extending Database Technology, Avignon, France, March 1996, the disclosures of which are incorporated by reference herein. However, all of these techniques are susceptible to the representation of the data used. In general, it is desirable to have a small set of features in order to effectively represent the data. Typical classification models respond more effectively to such sets of features.
Unfortunately, effective techniques for performing dimensionality reduction in the supervised domain do not exist and, as mentioned above, the existing techniques used in unsupervised problems for dimensionality reduction are not as effectively applicable to the supervised domain.
SUMMARY OF THE INVENTION
The present invention provides methods and apparatus for performing effective data transformation and dimensionality reduction in the supervised domain in accordance with both the class variable and the feature variables (also referred to herein as “features”). As mentioned above, existing dimensionality reduction, such as, for example, singular value decomposition, are practiced in the unsupervised domain. However, advantageously, the present invention provides methodologies for performing data transformation and dimensionality reduction in the supervised domain. The invention achieves at least two primary goals in the feature creation process:
(1) There is often considerable interdependence among the different features. For example, in a typical application, a person's age may be highly correlated with salary. Therefore, it may be useful to devise cases in which there is mutual independence in terms of the feature variables. In accordance with the present invention, methodologies are provided for performing transformations, so that there is independence among the feature variables.
(2) Certain features are inherently more discriminatory than others. By evaluating transformations of features, it is possible to devise features which are both discriminatory and non-redundant. The present invention provides methodologies for developing such sets of features.
In general, the present invention performs the separate processes of finding mutually independent features and finding features which have a very high discriminatory power. In accordance with the invention, this is performed by first transforming the data into a space where the features are represented by a set of mutually orthogonal vectors, and then selecting a subset of these vectors in order to represent the da
Aggarwal Charu C.
Yu Philip Shi-Lung
International Business Machines - Corporation
Mizrahi Diane D.
Zarick Gail H.
LandOfFree
Methods and apparatus for performing dimensionality... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Methods and apparatus for performing dimensionality..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods and apparatus for performing dimensionality... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3062274