Decision tree classifier with integrated building and...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06247016

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Description of the Prior Art
The proliferation of computerized database management systems has resulted in the accumulation of large amounts of data by the users of such systems. To be able to use these huge storehouses of data to best advantage, a process called “data mining” has emerged. Data mining involves the development of tools that can extract patterns from the data and the utilization of the patterns to establish trends which can be analyzed and used.
Classification is an important aspect of data mining and involves the grouping of the data to make it easier to work with. U.S. Pat. No. 5,787,274 to Agrawal et al. provides extensive discussion of prior art classification methods used in data mining and is incorporated herein by reference.
Decision tree classification is the classification method of choice when time cost is an issue. As described in more detail below, prior art decision-tree classification methods utilize a two-phase process: a “building” phase and a “pruning” phase. In the building phase, a decision tree is built by recursively partitioning a training data set until all the records in a partition belong to the same class. For every partition, a new node is added to the decision tree. A partition in which all the records have identical class labels is not partitioned further, and the leaf corresponding to it is labeled with the class label. Those records that do not fall into the class of a particular node are passed off as a new branch until they fall within a subsequent class.
A number of algorithms for inducing decision trees have been proposed over the years. See, for example, L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone,
Classification and Regression Trees
, Wadsworth, Belmont, 1984; B. D. Ripley,
Pattern Recognition and Neural Networks
, Cambridge University Press, Cambridge, 1996; Manish Mehta, Rakesh Agrawal, and Jorma Rissanen,
SLIQ: A Fast Scalable Classifier For Data Mining
, In EDBT 96, Avignon, France, March 1996; J. R. Quinlan,
Induction of Decision Trees
, Machine Learning, 1:81-106, 1986; J. Ross Quinlan, C4.5
: Programs for Machine Learning
, Morgan Kaufman, 1993; B. D. Ripley,
Pattern Recognition and Neural Networks
, Cambridge University Press, Cambridge, 1996; J. Ross Quinlan, C4.5
: Programs for Machine Learning
, Morgan Kaufman, 1993; John Shafer, Rakesh Agrawal, and Manish Mehta,
SPRINT: A Scalable Parallel Classifier for Data Mining
, In Proc. of the VLDB Conference, Bombay, India, September 1996; all of which are incorporated herein by reference.
The building phase constructs a perfect tree (a tree that classifies, with precision, every record from the training set). However, one often achieves greater accuracy in the classification of new objects by using an imperfect, smaller decision tree rather than one which perfectly classifies all known records. The reason is that a decision tree which precisely classifies every record may be overly sensitive to statistical irregularities and idiosyncrasies of the training set. Thus, all of the prior art algorithms known to the applicant perform a pruning phase after the building phase in which nodes are iteratively pruned to prevent “overfitting” and to obtain a tree with higher accuracy.
For pruning, the Minimum Description Length (“MDL”) principle (or other known pruning method) is applied to prune the tree built in the growing phase and make it more general. The MDL principle states that the “best” tree is the one that can be encoded using the fewest number of bits. Thus, under the MDL principle, during the pruning phase the subtree of the tree that can be encoded with the least number of bits is identified. The cost C (in bits) of communicating classes using a decision tree comprises (1) the bits to encode the structure of the tree itself, and (2) the number of bits needed to encode the classes of records in each leaf of the tree.
MDL pruning (1) leads to accurate trees for a wide range of data sets, (2) produces trees that are significantly smaller in size, and (3) is computationally efficient and does not use a separate data set for pruning. For the above reasons, the pruning algorithms of the present invention employ MDL pruning. A detailed explanation of MDL-based pruning can be found in
MDL-based Decision Tree Pruning
, Manish Mehta, Jorma Rissanen and Rakesh Agrawal, International Conference of Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August 1995, incorporated herein by reference.
2. The Process of Decision-Tree Based Classification
To understand the present invention it is necessary to have a basic understanding of the process involved in building and pruning a decision tree. An initial data set comprises a group of sample records called the “training set” and is used to establish a model or description of each class, which model is then used to classify future records as they are input to the database.
Each sample record has multiple attributes and is identified or “tagged” with a special classifying attribute which indicates a class to which the record belongs. Attributes can be continuous (e.g., a series of numerical values which can be placed in chronological order and which can be grouped based on their numerical value) or categorical (a series of categorical values which can only be grouped based on their category). For example, as shown in
FIG. 1
, a training set might comprise sample records identifying the salary level (continuous attributes) and education level (categorical attributes) of a group of applicants for loan approval. In this example, each record is tagged with either an “accept” classifying attribute or a “reject” classifying attribute, depending upon the parameters for acceptance or rejection set by the user of the database. The goal of the classification step is to generate a concise and meaningful description for each class that can be used to classify subsequent records.
As shown in
FIG. 1
, there is a single record corresponding to each loan request, each of which is tagged with either the “accept” label if the loan request is approved or the “reject” label if the loan request is denied. Each record is characterized by each of the two attributes, salary (e.g., $10,000 per year) and education level completed (e.g. high-school, undergraduate, graduate).
1. Building Phase
FIG. 2
is an example of a decision tree for the training data in FIG.
1
. Each internal node of the decision tree (denoted by a circle in
FIG. 2
) has a “test” involving an attribute, and an outgoing branch for each possible outcome. For example, at the root node
10
the test is “is the salary level of the applicant less than $20,000.00?” If the answer to this inquiry is “no,” the loan application is automatically accepted, ending the inquiry and establishing a “leaf”
20
(a leaf is the ultimate conclusion of a partition after no further inquiry is to be made, and is denoted by a square in
FIG. 2
) for the acceptance. Thus, in the example, an applicant who has a salary greater than $20,000 is classified in a class for those applicants who qualify for a loan based on their salary alone.
If the answer to the test at root node
10
is “yes,” (e.g., the applicants salary is less than $20,000) further inquiry is made to determine if the applicant can pass the test at internal node
30
, namely “does the applicant possess at least a graduate level of education?” If the answer to this inquiry is “yes,” then the loan is accepted, even though the salary level of the applicant is below the $20,000 threshold, and a leaf
40
is established for the acceptance. This places the applicant in a class comprising applicants who do not qualify for a loan based on salary but who do qualify based on their education level.
If the answer to the inquiry at node
30
is “no,” then the loan is rejected and a leaf
50
is established for the rejection.
The outcome of the test at an internal node determines the branch traversed and thus the next node visited. The class for the record is simply the class of

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Decision tree classifier with integrated building and... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Decision tree classifier with integrated building and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Decision tree classifier with integrated building and... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2454929

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.