Data processing: measuring – calibrating – or testing – Measurement system in a specific environment – Biological or biochemical
Reexamination Certificate
1999-09-24
2002-12-10
Borin, Michael (Department: 1631)
Data processing: measuring, calibrating, or testing
Measurement system in a specific environment
Biological or biochemical
Reexamination Certificate
active
06493637
ABSTRACT:
TECHNICAL FIELD
The invention relates to methods, devices and systems for coincidence detection among a multitude of variables. In addition, the invention relates to applying coincidence detection methods to various fields, and to products derived from such application.
BACKGROUND ART
k-tuples of Correlated Attributes
The discovery of correlations among pairs of k-tuples of variables has applications in many areas of science, medicine, industry and commerce. For example, it is of great interest to physicians and public health professionals to know which lifestyle, dietary, and environmental factors correlate with each other and with particular diseases in a database of patient histories. It is potentially profitable for a trader in stocks or commodities to discover a set of financial instruments whose prices covary over time. Sales staff in a supermarket chain or mail-order distributor would be interested in knowing that consumers who buy product A also tend to buy products B and C, and this can be discovered in a database of sales records. Computational molecular biologists and drug discovery researchers would like to infer aspects of 3D molecular structure from correlations between distant sequence elements in aligned sets of RNA or protein sequences.
One formulation of the general problem which encompasses many diverse applications, and which facilitates understanding of the principles described herein is a matrix of discrete features in which rows correspond to “objects” (such as individual patients, stock prices, consumers, or protein sequences) and the columns correspond to features, or attributes, or variables (such as lifestyle factors, stocks, sales items, or amino acid residue positions).
Mathematical methods for determining a measure of the type, degree, and statistical significance of correlation between any two, or even three or four, particular variables are widespread and well-understood. These methods include linear and nonlinear regression for continuous variables and contingency table analysis techniques for discrete variables. However, great difficulties arise when one tries to estimate correlation—or just estimate joint or conditional probabilities—over much larger sets of variables. This intractability has one main cause—there are too many joint attribute-value probability density terms—and this manifests itself in two serious problems: (1) computing and storing frequency counts over all terms, over the database, requires too much computation and memory; (2) there is usually an insufficient number of database records to support reliable probability estimates based on those frequency counts.
Let us consider some details. For M records (objects), N variables (attributes, fields), and supposing that each variable has the same set of |A| possible values, there are
(
k
)
N
=
N
!
(
N
-
k
)
!
⁢
k
!
k-tuples of columns. Adding the number of k-tuples for each k=1, 2, . . . , N
A
results in 2
N
−1 such tuples of all sizes. This exponential complexity has been a major obstacle standing in the way of higher-order probability estimation and correlation detection methodologies.
One natural way to think about this complexity is in terms of the power set of the set of column variables. This power set forms a mathematical lattice under the operation ⊂, a “tower” corresponding to a graph whose nodes are subsets of this set of column variables. (Note that if a set has N members, the power set has 2
N
members). From this viewpoint, two nodes representing subsets &sgr;
1
and &sgr;
2
are connected if and only if either &sgr;
1
⊂&sgr;
2
or &sgr;
2
⊂&sgr;
1
. We say that &sgr;
2
's node is above &sgr;
1
's if &sgr;
1
⊂&sgr;
2
. This gives a natural meaning to the term “higher-order”, as appearing higher up the tower. We call the bottom, the null set node, the 0th tier; the single column terms from the first tier, and so on.
Continuing with the tower analogy, we note that each “floor” of this edifice contains
(
k
)
N
“suites”, and each suite contains |A|
k
“rooms”. In other words, the kth level of the lattice corresponds to
(
k
)
N
different k-tuples of column variables, and associated with each k-tuple is an (|A| by |A| . . . by |A|/) contingency table, each cell of which must store the counted frequency of a particular joint symbol (a
i1
, a
i2
, . . . , a
ik
) were one to use a classical contingency table test for the correlation between those particular k columns. (See
FIG. 1
a
).
For any k∈{1, 2, . . . , N}, for any particular k-tuple of columns (c
j1
, c
j2
, . . . , c
jk
), there are |A|
k
possible joint values. For any k∈{1, 2, . . . , N}, for any particular k-tuple of columns (c
j1
, c
j2
, . . . , c
jk
), the estimation of Kullback divergence or other correlation function using the dataset is at least an &OHgr;(Mk) or &OHgr;(|A|
k
) computation, depending upon the relative sizes of M, k and |A|.
A comprehensive probabilistic model of the database must be able to specify probability estimates for
∑
k
=
1
N
⁢
⁢
(
k
)
N
⁢
&LeftBracketingBar;
A
&RightBracketingBar;
k
terms. This means, for example in the computational molecular biology domain, that for a tiny heptapeptide sequence family, each sequence having a length of seven amino acid residues, there are 1,801,088,540 terms to specify. For an unrealistically small RNA of fifteen nucleotides in length, over the smaller RNA alphabet of four base symbols, there are 30,517,578,124 terms.
Clearly the models can become intractably huge. What about the space of possible models through which a modelling/learning procedure must search? Consider a latent-variable model, which seeks to explain correlations between sets of observable variables by positing latent variables whose states influence the observables jointly. Since each model must specify a set of k-tuples of variables, and there are exp(2, 2
N
) (i.e., 2 to the power 2
N
) such sets, there are exp(2, 2
N
) possible models in the worst-case search space.
Various methods for determining a measure of higher-order probabilities will circumvent the combinatorial explosion through severe prior restrictions on the width k (See
FIG. 3
a
), the locality (
FIG. 2
a
), the number, or the degrees of correlation of the higher-order features sought, and on the kinds of models entertained (See
FIG. 4
a
).
Three Goals of Probability Estimation
It is useful, before discussing details of existing methods and of the current invention, to delineate three different possible goals of probability estimation in large datasets, each corresponding to a large body of research and current practice:
1. Estimation of the fully-specified, fully higher-order joint probability distribution: Estimate a probability density q that specifies
q(a
i1
@c
i1
, a
i2
@c
i2
, . . . , a
ik
@c
ik
)
for all k-tuples of attributes and possible values.
2. Hypothesis testing, for particular hypotheses concerning particular attributes and particular variables: For example, are the data consistent with the hypothesis that columns c
i1
, c
i2
, . . . , c
ik
are independent?
3. Feature detection, or “data mining”: Detect the most suspicious coincidences, for example, joint attribute occurrences that are more probable than would be predicted from lower-order marginals. Related to this, find the most highly correlated k-tuples of columns.
It is the feature detection and data mining applications that are most relevant to the present invention. However, some of the most successful ways to estimate a full higher-order joint probability distribution of a database require the specification of exactly those higher-order terms which represent high correlations among sets of k≧2 variables and invoking maximum entropy assumptions, and therefore the current invention is aimed at those applications as well.
Related Work
Various mathematical and computational methods have been propo
Borin Michael
Miernicki Steeg Carol
Queen's University at Kingston
Sterne Kessler Goldstein & Fox PLLC
Wilkes Robert H.
LandOfFree
Coincidence detection method, products and apparatus does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Coincidence detection method, products and apparatus, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Coincidence detection method, products and apparatus will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2930799