Data processing: artificial intelligence – Machine learning
Reexamination Certificate
1998-04-16
2001-05-08
Lintz, Paul R. (Department: 2762)
Data processing: artificial intelligence
Machine learning
C706S012000, C707S793000
Reexamination Certificate
active
06230151
ABSTRACT:
FIELD OF THE INVENTION
The invention relates in general to computer data mining, and in particular to a highly scalable method and system for generating a decision tree classifier from data records in a shared-memory multiprocessor system.
BACKGROUND OF THE INVENTION
Data mining is an emerging application of computer databases that involves the development of tools for analyzing large databases to extract useful information from them. As an example of data mining, customer purchasing patterns may be derived from a large customer transaction database by analyzing its transaction records. Such purchasing habits can provide valuable marketing information to retailers in displaying their merchandise or controlling the store inventory. Other applications of data mining include fraud detection, store location search, and medical diagnosis.
Classification of data records according to certain classes of the records is an important part of data mining. In classification, a set of example records, referred to as a training set or input data, is provided from which a record classifier will be built. Each record of the training set consists of several attributes where the attributes can be either numeric or categorical. Numeric (or continuous) attributes are those from an ordered domain, such as employee age or employee salary. Categorical attributes are those from an unordered domain such as marital status or gender. One of these attributes, called the classifying attribute, indicates the class to which the record belongs. The objective of classification is to build a model of the classifying attribute, or classifier, based upon the other attributes. Once the classifier is built, it can be used to determine the classes of future records.
A decision-tree classifier is a class discriminator that recursively partitions the training set until each partition consists entirely or dominantly of examples from the same class. The tree generally has a root node, interior nodes, and multiple leaf nodes where each leaf node is associated with the records belonging to a record class. Each non-leaf node of the tree contains a split point which is a test on one or more attributes to determine how the data records are partitioned at that node. Decision trees are compact, easy to understand and to be converted to classification rules, or to Structured Query Language (SQL) statements for accessing databases.
For example,
FIG. 1
shows a training set where each record represents a car insurance applicant and includes three attributes: Age, Car Type, and Risk level.
FIG. 2
is a prior art decision-tree classifier created from the training records of FIG.
1
. Nodes
20
and
22
are two split points that partition the records based on the split tests (Age<27) and (Car Type in {Sports}), respectively. The records of applicants whose age is less than 27 years belong to the High Risk class associated with node
24
. The records of those older than 27 years but have a sport car belong to the High Risk class associated with node
26
. Other applicants fall into the Low risk class of node
28
. The decision tree then can be used to screen future applicants by classifying them into the High or Low Risk categories.
Prior to interest in classification for database-centric data mining, it was tacitly assumed that the training sets could fit in memory. Recent work has targeted the massive training sets usual in data mining. Developing classification models using larger training sets can enable the development of higher accuracy models. Examples of recent classification systems that can handle large disk-resident datasets include SLIQ and SPRINT. SLIQ is described in the assignee's U.S. patent application for “Method and System for Generating a Decision-Tree Classifier for Data Records,” Ser. No. 08/564,694. It is a scalable decision-tree classifier in which the branches of the tree are built in parallel, breadth-first. However, SLIQ is limited by the amount of data that can be classified and, in a parallel computer such as a multi-processor system, it does not take advantage of the system's parallelism to build the decision tree across the processors.
A serial version of SPRINT was described in the assignee's U.S. patent application for “Method and System for Generating a Decision-Tree Classifier Independent of System Memory Size,” Ser. No. 08/646,893. Serial SPRINT also constructs the tree in a breadth-first order, but uses a one-time presorting technique to reduce the cost of numeric attribute evaluation. It initially creates a disk-based attribute list for each attribute in the data, where each entry of the list consists of an attribute value, a class label, and a tuple identifier (tid) of the corresponding data tuple. Initial lists for numeric attributes are sorted by attribute value, while those for categorical attributes stay in unsorted order. All attribute lists are initially associated with the root of the classification tree. As the tree is grown and split into subtrees, the attribute lists are also split. By preserving the order of records in the partitioned lists, no resorting is required.
FIG. 3
shows an example of the initial sorted attribute lists associated with the root node
0
of the tree for the training data in FIG.
1
and the resulting partitioned lists for the two children (nodes
1
and
2
).
To evaluate split points for the lists, serial SPRINT uses histograms to tabulate the class distributions of the records in an attribute list. Splits for a numeric attribute A are of the form {value(A)<x} where x is a value in the domain of A. Splits for a categorical attribute A are of the form {value(A) in X} where X is a subset of domain(A). The split test is chosen to “best” divide the training records associated with a node. Having found the winning split test for a node, the node's attribute lists are split into two to form the attribute lists for the child nodes (nodes
1
-
2
in FIG.
3
). The attribute list for the winning attribute (Age in our example at the root node
0
) is partitioned simply by scanning the list and applying the split test to each record. For the remaining “losing” attributes (Car Type in our example) more work is needed. While dividing the winning attribute, a probe structure (bit mask or hash table) on the tids is created, noting the child where a particular record belongs. While splitting other attribute lists, this structure is consulted for each record to determine the child where this record should be placed. If the probe structure is too big to fit in memory, the splitting takes multiple steps. In each step only a portion of the attribute lists are partitioned.
The serial SPRINT method was oriented to a single processor environment and does not take advantage of the resources in a shared-memory multiprocessor system in building the classifier.
A parallel version of SPRINT is described in the assignee's U.S. patent application for “Method and System for Generating a Decision-Tree Classifier In a Multi-Processor System,” Ser. No. 08/641,404 (hereinafter parallel SPRINT application). Like other previous parallel classification methods, parallel SPRINT is primarily oriented to distributed-memory (or shared-nothing) parallel machines. In such a machine, each processor has its own memory and local disks, and communicates with other processors only via passing messages. These methods require a significant amount of communication between the processors in order to evaluate a condition for splitting a tree node, thus limiting the number of possible tests that can be considered.
First, the training examples are distributed equally among all the processors. Each processor generates its own attribute list partitions in parallel by projecting each attribute from training examples it was assigned.
FIG. 4
shows a typical partitioning of two attribute lists (for node
0
in
FIG. 3
) over two processors using parallel SPRINT. Each processor is responsible for splitting its own attribute-list partitions in the split ph
Agrawal Rakesh
Ho Ching-Tien
Zaki Mohammed J.
International Business Machines - Corporation
Khatri Anil
Lintz Paul R.
McSwain Marc D.
Tran Khanh Q.
LandOfFree
Parallel classification for data mining in a shared-memory... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Parallel classification for data mining in a shared-memory..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Parallel classification for data mining in a shared-memory... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2557965