Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-01-27
2003-01-14
Shah, Sanjiv (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C706S050000
Reexamination Certificate
active
06507843
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to database technology and more particularly to systems of data mining.
BACKGROUND
A number of systems for classifying data and performing data mining have been proposed. A number of these techniques involve decision tees, linear classification trees, and association rules. However, each of these techniques has significant disadvantages.
Conventional techniques do not handle well biased sample data which has a large volume and high dimensions. They tend to ignore weak signals. Thus, a need clearly exists for an improved system of data mining and classifying data.
SUMMARY
The aspects of the invention, Classification by Aggregating Emerging Patterns (CAEP), are directed towards a system of extracting rules in the form of emerging patterns (EP) and constructing a classifier from correctly labelled data (e.g. DNA sequences) to decide which category a sample belongs to and/or making a prediction on the sample. The system discovers features or signals that differentiate one category of data from another and builds a system to classify such data.
The system is able to nicely handle biased data that has a large volume and high dimensions. Also the system does not ignore weak signals. The EPs are associated with supports and the ratios of change in supports. The system is robust in the presence of biased sample data and is scalable in terms of large numbers of samples and in terms of dimensions in practical situations.
An EP is a signal/itemset whose supports increase significantly from one class of data to the next. In other words, it is a differentiating factor between the two classes.
The aggregation of the differentiating strengths, in terms of their supports and ratio of change, of all of some set of discovered EPs (whose cardinality is not bounded before classifier construction) that occur in a new case in a decision step is novel.
The normalization by dividing by a base score chosen at some percentile (such as 50%) across training instance of all classes is novel.
One way to find emerging pattern is based on a border-based representation of very large collections of itemsets, and processes which derive EPs by operating (such as differentials) on some borders. These borders can be first efficiently discovered using the Max-Miner technique which is scalable in terms of large number of tuples and high dimensions in practical situations.
The EPs can be used in the protein translation start-site identification problem. This is an example application of CAEP to datamining in Molecular Biology.
The CAEP classifier (i) extracts emerging patterns (EPs), (ii) uses each of these EPs as a multiple-attribute test, (iii) aggregates the power of individual EPs to get raw scores, and (iv) normalizes the raw scores by dividing them using some base scores chosen from a certain percentile of the scores of the training instances. CAEP has near equal prediction accuracy on all classes. CAEP is based on a novel border-based representation of very large collections of itemsets. It derives EPs by operating on some borders (which can also be efficiently discovered). It is scalable in terms of large number of tuples and high dimensions in practical situations.
In accordance with a first aspect of the invention, there is disclosed a method of classifying data by aggregating emerging patterns in the data using datasets for a plurality of classes using a computer processor. In the method, for each of the classes, an emerging pattern set is mined dependent upon instances of the set and opponent instances dependent upon predetermined growth rate and support thresholds. Aggregate scores of the instances are calculated for all of the classes. Base scores are then determined for each of the classes. For each test instance, the following sub-steps are performed: aggregate and normalized scores of test instance are calculated for each class; and a specified class is assigned to the test instance for which the test instance has a largest normalized score.
The method assumes the preparatory step of partitioning an original dataset into a predetermined number of datasets to form the datasets. The predetermined number of datasets is dependent upon the number of classes,
Preferably, the method further includes the step of reducing the number of emerging patterns dependent upon growth rates and supports of the emerging patterns.
Preferably, the mining step includes the following steps: borders of large itemsets are determined using a large-border discovery technique; and supports and growth rates of emerging patterns are determined for the class. Optionally, the large-border discovery technique is the Max-Miner technique.
Optionally, the mining step includes the following steps: two borders of large itemsets (large borders for short) are determined of instances of the class and of the opponent class; and all emerging pattern borders are found using multiple border pairs.
In accordance with a second aspect of the invention, there is disclosed an apparatus having a computer processor for classifying data by aggregating emerging patterns in the data using datasets for a plurality of classes. The apparatus includes:
a device for, for each of the classes, mining an emerging pattern set dependent upon instances of the class and opponent instances dependent upon predetermined growth rate and support thresholds;
a device for calculating aggregate scores of the instances for all of the classes;
a device for determining base scores for each of the classes; and
a device for, for each test instance, performing specified operations, the performing device including:
a device for calculating aggregate and normalized scores of test instance for each class; and
a device for assigning a specified class to the test instance for which the test instance has a largest normalized score.
In accordance with a third aspect of the invention, there is disclosed a computer program product having a computer readable medium having a computer program recorded therein for classifying data by aggregating emerging patterns in the data using datasets for a plurality of classes. The computer program product includes:
a computer program source code module for, for each of the classes, mining an emerging pattern set dependent upon instances of the class and opponent instances dependent upon predetermined growth rate and support thresholds;
a computer program source code module for calculating aggregate scores of the instances for all of the classes;
a computer program source code module for determining base scores for each of the classes; and
a computer program source code module for, for each test instance, performing specified operations, the computer program source code performing module includes:
a computer program source code module for calculating aggregate and normalized scores of test instance for each class, and
a computer program source code module for assigning a specified class to the test instance for which the test instance has a largest normalized score.
In accordance with a fourth aspect of the invention, there is disclosed a system for extracting emerging patterns from data using a processor. The system includes:
a device for mining emerging patterns for all of a number of categories of the data;
a device for computing aggregate differentiating scores for all samples of the data and the categories; and
a device for computing base scores for the categories.
Preferably, the system further includes a device for extracting the emerging patterns from the mined emerging patterns dependent upon the aggregated differentiating scores and the base scores.
Preferably, the system further includes a device for reducing the number of related emerging patterns. Two emerging patterns are related if one is a sub-pattern or subset of the other. Optionally, the system further includes a device for indicating whether the set of derived emerging patterns is to be reduced, operations of the reducing device being dependent upon the indicating device.
Optionally, the system further includes a device fo
Dong Guozhu
Li Jinyan
Wong Limsoon
Zhang Xiuzhen
Berkowitz Marvin C.
Kent Ridge Digital Labs
Liang Gwen
Nath & Associates PLLC
Novick Harold L.
LandOfFree
Method and apparatus for classification of data by... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for classification of data by..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for classification of data by... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3007415