Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1999-12-15
2002-08-27
Mizrahi, Diane D. (Department: 2175)
Data processing: database and file management or data structures
Database design
Data structure types
Reexamination Certificate
active
06442561
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates to computer techniques for developing binary decision trees from a training database, such decision trees used for classifying records according to probabilities derived from the training database. Specifically, the present invention provides a way of preparing or updating binary decision trees from very large training databases held in slow memory such as disk drives, the method reducing the necessary access to the slow memory.
Referring to
FIG. 1
, a large training database
10
has records
12
including a record identifier
14
, record attributes
16
, and a classification
18
. For example, the record identifier
14
may be the name of a customer and the attributes may be the customer's AGE, INCOME, and number of CHILDREN. The classification
18
may be, for example, whether the customer responded to a promotional coupon for children's toys.
Desirably, the classification
18
could be determined for existing customers in a unclassified data
26
whose attributes
16
are known but who have not yet responded to the promotional coupon and thus cannot be classified. “Data mining” seeks to establish a predictive classification of records based on the record's attributes
16
.
Referring to
FIG. 2
, the classification of records from their attributes may be accomplished by preparing a binary decision tree
24
from the training database
20
using any of a number of tree constructors
22
executed on an electronic computer as are well known in the art. The binary decision tree
24
is then used to sort the unclassified data
26
to produce as results
32
the appropriate classification.
Referring to
FIG. 3
, the binary decision tree
24
follows general tree topology including a root node
28
a
(shown at the top of FIG.
3
), a number of intermediate nodes
28
, and leaf nodes
30
(shown at the bottom of FIG.
3
). Each intermediate node
28
is assigned to a particular attribute
16
and a split point in the domain of the attribute
16
which defines how records are to be sorted or passed to the nodes below. Each leaf node
30
is assigned to a particular classification.
The unclassified data
26
are sorted by comparing their attributes and the values of those attributes against the attributes and split points of each node starting at root node
28
a
and then passing the record according to that split point to the next lower node
28
b
. Thus, for example, the root node
28
a
may relate to the AGE attribute and have a splitting of AGE
30
(and a “splitting predicate”) that AGE must be less than or equal to
30
). The records
12
of
FIG. 1
are thus sorted at the root node
28
a
so that if their AGE attribute
16
has a value of less than 30, the record
12
proceeds down the right branch of the tree from root node
28
a
, but if the AGE attribute has a value greater than 30, the record
12
proceeds down the left branch of the tree from root node
28
a
. The branches from node
28
a
lead to additional nodes
28
b
and
28
c
, each also having an attribute and a splitting predicate and this process is repeated until the records arrive at a leaf node
30
where a category may be assigned. Note that the attributes for
28
b
and
28
c
need not be the same and in this case are AGE and INCOME, respectively.
The attributes
16
need not be numerical but may be categorical, for example, male or female, in which case the splitting predicate is a subset of the attributes' domain.
Referring to
FIG. 4
, the tree constructor
22
which creates the binary decision tree
24
from the training database
20
may operate according to a number of well known algorithms to determine the attributes, their order within the binary decision tree
24
, and the appropriate splitting predicates. A general model of a tree constructor
22
includes a sorter
35
receiving the records
12
and at each node
28
dividing them into left and right groups
38
and
40
according to a trial splitting predicate
36
. The left and right groups
38
and
40
are provided to a goodness evaluator
42
which determines how effective the trial splitting predicate
36
is according to some predetermined criteria related to the classifications of the records of the left and right groups
38
and
40
, for example, an impurity function.
The trial splitting predicate
36
is adjusted appropriately based on this determination and the records
12
reviewed again for evaluation. Ultimately, after possibly many reviews of the records, final splitting predicate
45
is produced (being an attribute, split point and relationship) for the node
28
and the process is repeated for other nodes
28
. A goodness value
43
may be derived for each splitting predicate
45
.
While particular tree construction algorithms vary, it can be seen that this process of determining splitting predicates
45
requires repeated access of the records
12
. For large databases where the records
12
are held in relatively slow electronic memories, such as magnetic disk drives, constructing the binary decision tree
24
may be prohibitively time consuming. Even in cases where this investment in time is warranted for an initial generation of a binary decision tree
24
, the time investment may discourage frequent updating of the binary decision tree
24
as additional data comes in.
One solution to the problem of slow memory access is to place the training database
20
in a high-speed memory such as those principally constructed of solid state transistors also known as random access memory (RAM). Such memories will be termed herein “high-access” memories distinguishing them from disk drives and other similar mass storage devices (“low access”), both in the speed of memory access and in the flexibility of that access (random vs. sequential) which may affect the time required to access the necessary data of the training database
20
. These categories are not absolute but reflect the inevitable differences between accessibility and capacity of current and foreseeable memory systems.
Unfortunately, the solution of using high access memory exclusively is not available for many commercially valuable training databases
20
which are too large for this to be practical. What is needed is a method of constructing and updating training databases
20
that overcomes the time limitation inherent in the use of low-access memory.
BRIEF SUMMARY OF THE INVENTION
The present inventors have recognized that a binary decision tree constructed from a small subset of the training database (sized to fit entirely in high access memory) will nevertheless be close to the binary decision tree that would have been constructed with the entire training database. This “small-sample” binary decision tree constructed from the subset may be then used to coordinate an efficient review of the entire training database that reduces accesses to the memory in which it is stored.
Specifically, the present invention provides a method of data mining using a computer system having a first low-access memory holding a training database of a plurality of records having attributes and a second high-access memory smaller than the first memory. A subset of the training database is loaded into the second memory and the computer operates on that subset to prepare an initial binary decision tree having nodes associated with confidence intervals defining ranges of the attributes expected in the final binary decision tree for the entire training database. The entire training database is then read from the first memory against the confidence intervals of the binary decision tree to collect split point statistics related to the location of a split point within the confidence intervals. Using the split point statistics, a split point is assigned to each node.
Thus it is one object of the invention to speed the construction or updating of binary decision trees from large training databases. By using a subset of the training database to develop an initial binary decision tree, access to the first memory is substantia
Ganti Venkatesh
Gehrke Johannes E.
Ramakrishnan Raghu
Mizrahi Diane D.
Mofiz Apu M
Quarles & Brady LLP
Wisconsin Alumni Research Foundation
LandOfFree
Method of constructing binary decision trees with reduced... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method of constructing binary decision trees with reduced..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method of constructing binary decision trees with reduced... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2970920