Methods and apparatus for user-centered class supervision

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C706S045000

Reexamination Certificate

active

06804669

ABSTRACT:

FIELD OF THE INVENTION
The field of the present invention relates to data mining techniques and, more particularly, to techniques for incorporating human interaction in an effective way so as to design similarity functions and perform class supervision of data.
BACKGROUND OF THE INVENTION
The design of data mining applications has received much attention in recent years. Examples of such applications include similarity determination and classification. In the context of data mining, it is assumed that we are dealing with a data set containing N objects in a dimensionality of d. Thus, in this data space, each object X can be represented by the d coordinates (x(1), . . . x(d)). These d coordinates are also referred to as the features in the data. This is also referred to as the feature space which may reveal interesting characteristics of the data.
The effective design of distance functions used in similarity determination has been viewed as an important task in many data mining applications. The concept of similarity has been widely discussed in the data mining literature. A significant amount of research has been applied to similarity techniques such as, for example, those discussed in the literature: A. Hinneburg et al., “What is the nearest neighbor in High Dimensional Space?,” VLDB Conference, 2000; C. C. Aggarwal, “Re-designing distance functions and distance based applications for high dimensional data,” ACM SIGMOD Record, March 2001; and C. C. Aggarwal et al., “Reversing the dimensionality curse for similarity indexing in high dimensional space,” ACM SIGKDD Conference, 2001, the disclosures of which are incorporated by reference herein.
A different but related problem in data mining is the prediction of particular class labels from the feature attributes. In this problem, there is a set of features, and a special variable called the class variable. The class variable typically draws its value out of a discrete set of classes C(1), . . . C(k). A test instance is defined to be a data example for which only the feature variables are known, but the class variable is unknown. Training data is used in order to construct a model which relates the features in the training data to the class variable. This model can then be used in order to predict the class behavior of individual test instances, also referred to as class labeling. The problem of classification has been widely studied in the literature, e.g., J. Gehrke et al., “BOAT: Optimistic Decision Tree Construction,” ACM SIGMOD Conference Proceedings, pp. 169-180, 1999; J. Gehrke et al., “RainForest: A Framework for Fast Decision Tree Construction of Large Data Sets,” VLDB Conference Proceedings, 1998; R. Rastogi et al., “PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning,” VLDB Conference, 1998; J. Shafer et al., “SPRINT: A Scalable Parallel Classifier for Data Mining,” VLDB Conference, 1996; and M. Mehta et al., “SLIQ: A Fast Scalable Classifier for Data Mining,” EDBT Conference, 1996, the disclosures of which are incorporated by reference herein.
However, as sophisticated and, in some cases, complex as these similarity and classification techniques may be, these conventional automated techniques lack benefits that may be derived from human interaction during their design and application stages. Therefore, techniques are needed that effectively employ human interaction in order to design and/or perform data mining applications such as similarity determination and classification.
SUMMARY OF THE INVENTION
The present invention provides techniques for incorporating human or user interaction in accordance with the design and/or performance of data mining applications such as similarity determination and classification. Such user-centered techniques permit the mining of interesting characteristics of data in a data or feature space. For example, such interesting characteristics that may be determined in accordance with the user-centered mining techniques of the invention may include a determination of similarity among different data objects, as well as the determination of individual class labels. These techniques allow effective data mining applications to be performed in accordance with high dimensional data.
In accordance with a first aspect of the present invention, a computer-based technique of computing a similarity function from a data set of objects comprises the following steps/operations. First, a training set of objects is obtained. The user may preferably provide such training data. Next, the user is presented with one or more subsets of objects based on the training set of objects, wherein each subset comprises at least two objects of the data set. Preferably, the subset is a pair of objects from the data set. The user then provides feedback regarding similarity between the one or more subsets of objects. One or more sets of feature variables are defined based on features in the one or more subsets of objects. Next, one or more class variables are created in accordance with the user-provided feedback. Lastly, a similarity function or model is constructed which relates the one or more sets of feature variables to the one or more class variables.
Thus, advantageously, similarity between objects is represented as some function or algorithm determined by the attributes of the objects. The similarity model is then effectively estimated from the data set and user reactions.
In accordance with a second aspect of the present invention, a computer-based technique of classifying a test instance in accordance with a data set comprises the following steps. First, a test instance is obtained. The user may preferably provide such test instance. Next, the user is presented with at least one projection representing a distribution of the data set. The user then isolates a portion of the data presented in the at least one projection based on a relationship between the test instance and the data presented in the at least one projection. For instance, the user may isolate a subset of the data in the projection which the user determines to be most closely related to the test instance. Next, the behavior of the isolated portion of data is determined. Then, a class is determined for the test instance based on the isolated portion of data, when the user makes a decision to do so based on the determined behavior of the isolated portion of data. Alternatively, when the user makes a decision not to have a class determined for the test instance based on the isolated portion of data, other portions of the data set or a subset of the isolated portion of the data may be considered.
Further, in a preferred embodiment, the user is presented with two or more projections respectively representing different distributions of the data set such that the user may select one of the projections to be used when isolating a portion of data whose behavior is to be considered.
Thus, advantageously, such a class labeling methodology according to the invention provides a technique of decision path construction, in which the user is provided with the exploratory ability to construct a sequence of hierarchically chosen decision predicates. This technique provides a clear understanding of the classification characteristics of a given test instance. At a given node on the decision path, the user is provided with a visual or textual representation of the data in a small number of sub-spaces. This can be used in order to explore particular branches, backtrack or zoom-in into particular sub-space-specific data localities which are highly indicative of the behavior of that test instance. This process continues until the user is able to construct a path with successive zoom-ins which is sufficiently indicative of a particular class. The process of zooming-in is done with the use of visual aids, and can isolate data localities of arbitrary shapes in a given sub-space.
It is to be appreciated that the classification techniques of the present invention are more powerful than any of the conventional classification methods, since the invention uses

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Methods and apparatus for user-centered class supervision does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Methods and apparatus for user-centered class supervision, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods and apparatus for user-centered class supervision will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3329349

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.