Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-04-10
2003-03-11
Corrielus, Jean M. (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000
Reexamination Certificate
active
06532467
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention is directed toward the field of data mining. More specifically, the invention provides a method of selecting particular variables from a large data set containing a plurality of variables to be used as nodes in a binary decision tree. The invention is particularly useful with large data sets in which some variables are partly co-linear. An example of this type of large data set is genomic data taken from human chromosomes which can be used to associate genotypic data with disease status.
2. Description of the Related Art
Binary decision trees are known in the field of data mining. Generally, the decision tree utilizes a search method to choose the best variable on which to base a particular decision making process. The best variable is chosen from a set of input variables in an input data set, where the outcome measure is known for each set of input variables. A hierarchy of decisions are built into the decision tree using a “yes
o” structure in order to arrive at an outcome from a set of known possible outcomes. At each node of the decision tree, the input data set is split into two subsets based on the value of the best variable at that point in the binary tree structure. The best variable is thus defined as the “node variable” because it is the variable that the decision tree branches from at that point in the path of the decision making process. The tree continues to branch to the next best available variable until some minimum statistical threshold is met, or until some maximum number of branches are formed. A subsequent set of input data values for each of the variables can then return a predicted outcome.
Using a binary decision tree is particularly useful in the study of genomic mapping. In such a study, a binary tree is constructed to match genetic markers to a phenotypic trait. One such phenotypic trait could be disease status. In this example, the binary tree categorizes whether or not a subject is likely to have a particular disease by selecting a path through the tree based on the values of the specific markers that form the nodes of the tree. The input data set can then be categorized into one of the disease outcomes, either affected or not affected.
A known method for selecting the node variable that forms a node of the tree branch for the example genomic application is shown in FIG.
1
. An input data set
10
includes a plurality of rows
12
, each row defining a subject. Each column
14
describes a particular variable. The first variable is typically a patient identifier. Clinical variables
16
, such as age and weight, are shown in columns 3 to N+1 where N is the number of clinical variables
16
. Clinical variables
16
are variables that can generally be taken by an examiner or through a set of simple questions asked of the patient. In the columns after the clinical variables
16
, a plurality of genomic markers (“marker variables”)
18
, taken from the DNA of a cell of the patient, are recorded. In this example, twenty-five genetic markers
18
are recorded from each patient. The recording of the markers
18
requires utilizing at least one specialized instrument to take a sample and record the values of each of the twenty-five markers
18
. The disease state
20
is the final column in the data set, and it is the outcome measure of the data set
10
, i.e. whether the particular patient is affected or not. The disease state
20
is known for each subject in the input data set
10
.
For each variable (clinical and marker), the values are binned into two groups. For instance, the clinical variable “sex” is binned into a male group and a female group. Other variables, such as the clinical variable “age” are considered interval variables. An interval variable is a variable that has a continuous distribution over a particular range. The interval variable is initially separated into a user-defined number of bins. These bins are then grouped to form two bins. For example, the clinical variable age might first be reduced to 10 levels of 10 years each. The 10 levels will be grouped into 2 bins, based on the results of a statistical test described below. The process of reducing the variable to two bins will first measure the first level against the second through the tenth levels. The process continues by measuring the first and second levels against the third through the tenth, until eventually the first nine levels are measured against the tenth level. The best statistical result will define the delineation point for the variable.
The marker variables
18
are categorized by a bi-allelic genotype. Generally, these genotypes are referred to as AA, Aa, or aa. AA is the homozygote genotype for allele A, Aa is the heterozygous genotype, and aa is the homozygote genotype for allele a. Since three bi-allelic genotypes exist, the two bins are separated
30
into a pair of two genotypes and a single genotype for each marker
18
. This binning is accomplished by a similar statistical step as the binning of the clinical variables. Once the binning is completed, a statistical measure of correlation is calculated for each marker. An example of such a statistical calculation is the chi squared statistic as referenced in “Principles and Procedures of Statistics a Biometrical Approach”, pages 502-526, which is incorporated by reference herein. A plot
40
of one set of the chi-squared statistic is shown in
FIG. 1. A
large chi-squared statistic suggests a marker that is highly associated with the disease state. The most highly associated marker is selected for the first node in the binary tree by selecting the largest chi squared statistic.
More specifically, the steps of building a binary decision tree for analyzing this type of data set is shown in
FIGS. 2 and 3
.
FIG. 2
shows the method of building the decision tree.
FIGS. 3A and B
show the steps of creating the two bins for each variable.
FIG. 3A
shows the steps of creating the two bins for an interval variable, and
FIG. 3B
shows the steps of forming the two bins for variables other than interval variables.
Turning now to
FIG. 2
, the input data set
10
is provided to the method in step
50
. The method is generally implemented as a software program operating on a general purpose computer system. At step
52
, the user enters a number of algorithmic parameters, such as the number of passes the user wishes the tree to branch to, a minimum value for the least significant chi square statistic, and the number of inputs. An input counter, “i”, and a maximum value, “MAXSOFAR”, are initialized at step
52
. The first variable is then retrieved from the input data set for all subjects. Step
54
determines if the first variable is an interval variable. If the first variable is an interval variable, then it is passed to step
56
where the steps of
FIG. 3A
return a TEST value of the best chi square statistic from the two bin structure of the particular variable. If, however, the first variable is not an interval variable, then it is passed to the steps of
FIG. 3B
in step
58
, which also returns a TEST value indicating the best chi square statistic for the particular variable.
Step
60
determines if the TEST value from step
56
or step
58
is greater than the MAXSOFAR value, i.e., is the chi-squared statistic for the current variable larger than the chi-squared values for all the previously analyzed variables. If the TEST value is greater, then the TEST value is stored
62
as the MAXSOFAR value and the input counter is updated
64
. If the TEST value is not larger than MAXSOFAR, then the input counter is updated
64
without storing the TEST result. Step
66
determines if the input counter (i) is less than the number of input variables in the data set. If the input counter (i) is less than the number of inputs, control returns to step
54
using the next variable for the determining step
54
. Once the input counter (i) is equal to the number of input variables, step
68
determines if MAXSOFAR is less than a user-defined paramet
Brocklebank John C.
Czika Wendy
Weir Bruce S.
Corrielus Jean M.
Jones Day Reavis & Pogue
Nguyen Tam
SAS Institute Inc.
LandOfFree
Method for selecting node variables in a binary decision... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for selecting node variables in a binary decision..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for selecting node variables in a binary decision... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3006176