Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1999-12-28
2002-01-22
Black, Thomas (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06341284
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to data mining, and, more specifically, to software tools that analyze similarity amongst descriptors of items contained in a database.
2. Description of the Related Art
In a large number of molecular biology tasks descriptors are used to characterize and quantify properties for the item or items of interest. For example in the simplest form of gene expression analysis, a gene's behavior is observed over time and its level of expression tracked as environmental conditions change; many and sometimes all of the genes of a given organism are tracked over time. In the more traditional drug discovery, descriptors (values that characterize structure, charge or other property) for atomic complexes, such as molecules or portions of molecules, form the basis for the selection of potential drug candidates.
Although these are only two examples but they highlight the nature of the problem. Each item of interest (e.g. gene, gene fragment, protein fragment, DNA fragment, atomic complex etc.) is characterized by a set of observable properties (e.g. expression level as a function of a changing environment, shape, mass, inertia, the steric and electrostatics fields it generates, interaction affinity etc.) that are inherently related to its identity or composition. Over the years, various representation schemes have been proposed that attempt to capture and represent some of these observable properties by implicit or explicit calculations carried out on one or more components of the item of interest. As a rule of thumb, there exist no universal representation schemes. Moreover, a given representation scheme typically provides an approximate description of the observable property (-ies) it attempts to capture. For example, and in the general case, the electrostatic field outside of an atomic complex can be described completely with the help of an infinite number of moments (i.e., point charge, dipole, quadrupole, octapole, hexadecapole, and so on); however, low order approximations are typically used that provide a less accurate approximation, but easy to compute.
Given a collection of items of interest (e.g. genes of an organism, gene fragments, protein fragments, DNA fragments, atomic complexes that comprise molecules or molecule fragments) and a representation scheme, the discovery process entails an educated selection among the descriptor set of the items. This selection invariably involves comparisons of and is dependent and based upon the representations that have been chosen for the items under consideration. In the discussion that follows, we will be drawing our examples from the specific problems of gene expression analysis and rational drug design but the described method and system are applicable to any situation where one is given a database of items to be studied with each of the items being represented by a set of descriptors and what is sought is the discovery of collections of items that have values which agree across one or more descriptors.
Software tools have been developed to aid in the discovery process. These tools generally provide the user with the ability to specify a particular descriptor, and then identify the number of items in the set being studied that have the same quantized value for a particular descriptor. In addition, such tools typically provide the user with the ability to specify two or more descriptors, and then identify the number of items in the set being studied that have the same quantized value for the specified descriptors. Alternatively, one can think of the input that such tools process as a two dimensional array of values (the number of rows is equal to the number of items in the database (e.g. genes, gene fragments, protein fragments, DNA fragments, atomic complexes, etc.) while the number of columns is equal to the number of descriptors used by the representation scheme): after one or more columns have been identified, the tool will determine and report to the user those of the database items (e.g. genes, atomic complexes etc.) whose values agree across the entire set of identified columns. Occasionally, the user is also required to specify a minimum support, i.e. the minimum number of atomic complexes that must share the same collection of values across the specified columns. An example of such a tool in the context of rational drug design is DIVA, distributed by Oxford Molecular, additional information regarding this tool may be found at www.oxmol.com. The two-dimensional table relating database items with their attributes is a specific manifestation of what has been called the binary relationship in the context of Galois Concept lattices. A binary relationship consists of explicitly stating the attributes(descriptors) of a group of objects (genes, molecules, etc.). The present work is a unique tool in providing in a most efficient manner the entire hierarchy of descriptor correspondences and the support for each such correspondence. Namely, how many objects (genes, gene fragments, protein fragments, DNA fragments, atomic complexes, etc.) agree with respect to every conceivable subset of descriptor correspondences.
To stress the generality of the problem as we have just defined it, we wish to note that the following database comprising 5 items that are represented using a set of 6 descriptors
s
1
=
rent
R&B
romance
BSc
$30K
lexus
s
2
=
rent
rock
fiction
MSc
$50K
bmw
s
3
=
own
jazz
science-fiction
PhD
$70K
chevy
s
4
=
own
jazz
romance
PhD
$70K
dodge
s
5
=
rent
R&B
fiction
MSc
$30K
ford
can be equivalently represented
s
1
=
1
3
6
9
12
15
s
2
=
1
4
7
10
13
16
s
3
=
2
5
8
11
14
17
s
4
=
2
5
6
11
14
18
s
5
=
1
3
7
10
14
19
s
1
= e
1
e
3
e
6
e
9
e
12
e
15
s
2
= e
1
e
4
e
7
e
10
e
13
e
16
s
3
= e
2
e
5
e
8
e
11
e
14
e
17
s
4
= e
2
e
5
e
6
e
11
e
14
e
18
s
5
= e
1
e
3
e
7
e
10
e
14
e
19
where {e
1
, e
2
, e
3
, e
4
, e
5
, e
6
, e
7
, e
8
, e
9
, e
10
, e
11
, e
12
, e
13
, e
14
, e
15
, e
16
, e
17
, e
18
, e
19
} are user-defined symbols, without any loss of information. Correspondences discovered on input of this type are typically thought of as associations and the problem of discovering these corresponsdences is also referred to as association discovery.
The problem described above is computationally very demanding and although it is tractable for small collections of database items, its computational requirements become prohibitive very quickly independent on whether the user drives the discovery process or the computer performs a brute force exhaustive enumeration. Let us first examine the case where the user drives the discovery. Every time the user identifies a collection of descriptors of interest the user has made a hypothesis about a potential relationship (i.e. association) among the descriptors. The computational tool will then process the input in order to verify whether the user's hypothesis is supported by the input set. Each hypothesis that is corroborated by the computational tool gives rise to a collection V of atomic complexes that is then examined in order to determine its quality: this can be achieved easily by examining the observable properties of each of the V reported complexes to determine an agreement; if successful, a potential causal relationship is formed that leads from a specific choice of values for the identified set of descriptors to the intersection of observable properties of the respective set of reported database items; the descriptors in the specified set that led to a verified hypothesis are then thought of as being “associated” or “correlated.” These formed causal relationships are further analyzed for validity using other means of investigation.
If a hypothesis proves to lead to no such potentially promising results, the user identifies a new collection of descriptors and lets the computational tool repeat the processing. In the absence of any tas
Floratos Aris
Rigoutsos Isidore
Silverman B. David
Black Thomas
Bruzzone Lauren C.
Jung David
LandOfFree
Method and apparatus for exhaustive enumeration of... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for exhaustive enumeration of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for exhaustive enumeration of... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2850337