Data processing: measuring – calibrating – or testing – Measurement system in a specific environment – Biological or biochemical
Reexamination Certificate
1998-04-13
2003-03-18
Brusca, John S. (Department: 1631)
Data processing: measuring, calibrating, or testing
Measurement system in a specific environment
Biological or biochemical
C435S006120
Reexamination Certificate
active
06535819
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
A method is presented which identifies distinctive items of information from a larger body of information on the basis of similarities or dissimilarities among the items. More specifically, the method achieves a significant increase in speed as well as the ability to balance the representativeness and diversity among the identified items by applying selection criteria to randomly chosen subsamples of all the information.
2. Description of Background Art
Most disciplines from economics to chemistry are benefiting from the ability of the modern computer to store and retrieve vast amounts of data. While the chore of actually handling the data has been reduced, the ability to gather, store, and operate with the enormous amount of data has, in itself, created new problems. In particular, in some cases the amount of data has become so vast that it is difficult to comprehend the range of data, much less to understand and derive meaningful information from it. To address this problem, attempts have been made in the prior art to find ways to meaningfully group or abstract some structure from the information.
For instance, in many situations it is useful to use a stratified sampling procedure for getting aggregate information about a population within which different groups cost different amounts to survey, or whose response have different meanings to the questioner, etc. However, while the stratified sampling approach is very efficient, to use it one must be able to quickly get information about each cluster within the population, and be able to select representative people to poll. The method of this invention permits the rapid evaluation of the demographic profiles in such a situation to see how many people in a random sample are “closest” to each selectee. Small, but information profitable target groups can then be surveyed in this way.
The approach of this invention could also aid in the design of clinical trials. It often happens that drugs tested in a small random sample fail later because of severe adverse reactions in a small subpopulation. Using the method of this invention, a selection based on medical and family history could produce a more useful early phase test population. The following discussion of the method of this invention is presented in terms useful to medicinal chemists who are concerned with identifying subpopulations of chemical compounds. However, as can be seen from the brief examples above, the method of the invention is general and can equally well be applied to other fields by those skilled in the art. The generality of the method is readily appreciated if for the term “compound” used in this disclosure, the term “data element” is substituted. Brief examples of other specific applications of the methodology of the invention are set out at the end of the disclosure.
The advent of combinatorial chemistry and high through-put screening has made the ability to identify “good” subsets in large libraries of compounds very important, whether the libraries in question have actually been synthesized or exist as assemblies of virtual molecular representations in a computer. One kind of a good subset is one which contains members which represent the chemical diversity inherent in the entire library while at the same time not containing more members than are necessary to sample the inherent diversity. This desire for a good subset is driven by the fact that these libraries can contain an enormous number of compounds which, if every single compound were tested, would be extremely costly in terms of money, time, and resources to individually evaluate. Essentially, what is desired is a subset of a size which is reasonable to test. Clearly, good subsets can be generated based upon other criteria as well, such as ease of synthesis or lowest cost. Traditionally such subsets have been created by expert systems—i.e., having a medicinal or pesticide chemist select compounds manually based on a series 2D structures. This approach is labor-intensive and is dependent on the expert used. Moreover, it is neither routinely practical nor very good for more than 300-1000 compounds, and then only when the library in question includes one or more homologous series. In the prior art, currently available alternative approaches for selection include maximum dissimilarity selection, minimum dissimilarity selection, and hierarchical clustering, among others
1
. Each of the available methods can be effective, but each has some intrinsic limitations.
Maximum Dissimilarity: The methods currently most often used for selecting compounds focus on maximizing the diversity of the selected subset with respect to the set as a whole using a descriptor (metric) which characterizes the members of the set and an associated (dis)similarity measure
1-3
. The basic approach is straightforward, and utilizes as parameters a minimum acceptable dissimilarity (redundancy) threshold R and a maximum selected subset size M
max
. The approach is essentially as follows:
1. Select a compound at random from the dataset of interest, add it to the selection subset, and create a pool of candidate compounds out of the remainder of the dataset.
2. Examine the pool of candidates, and, using the characterizing measure (metric), identify the candidate which is most dissimilar to those which have already been selected.
3. Determine whether the dissimilarity of the most dissimilar candidate is less than R (redundancy test). If it is less than R, stop. If it is not less than R, add that candidate to the selection set and remove it from the pool of candidates.
4. If the compound to be selected in this step is the third compound being selected, after its selection return the first two selections to the pool of candidate compounds. (The first selection was chosen randomly, and the second selection is strongly biased by the first. Transferring them back into the candidate pool reduces the effect of the initial random selection.)
5. If the desired subset size M has been reached, stop.
6. If there are no more candidates in the pool, stop. If there are more candidates in the pool, go back to step 2.
A related method developed by Agrafiotis works by comparing evolving subsets to maximize the diversity across the entire set
4
. Maximally diverse subsets are, by definition, biased towards inclusion of outliers; i.e., those candidates most dissimilar from the group as a whole. In some situations, this is very useful property, but medicinal chemists tend to avoid outliers in making their own selections because they may not “look like” drugs. In some cases, outliers in corporate databases are outliers for good reason—difficulty of synthesis or toxicity, for example—which reduces their value as potential leads. Moreover, a maximally diverse subset may not be adequately representative of the biochemical diversity in a dataset.
One justification in drug research for maximizing diversity is based on experimental design considerations commonly employed for analyzing quantitative structure/activity relationships (QSARs),
5
where outliers are important because they have the greatest weight. The libraries from which subsets are to be selected are usually much more diverse and much larger in size than those used for QSAR, however. In such a situation, outliers loose their statistical leverage because it may no longer be possible to use the QSAR approach to adequately approximate biochemical responses as linear or quadratic functions of the descriptors (metrics) being used.
Minimum Dissimilarity: Recently, Robert Pearlman et al. introduced an alternative approach to compound selection (the “elimination method” in DiverseSolutions
6
) which can be characterized as minimum dissimilarity selection. The approach takes the same two parameters as maximum dissimilarity selection—a minimum dissimilarity threshold R and M
max
, the maximum number of compounds to select—but applies them differently as follows:
1. Select a compound at random from the dataset of interest, add it to the selection set, and creat
Brusca John S.
Tripos, Inc.
Weinberger Laurence
LandOfFree
Optimal dissimilarity method for choosing distinctive items... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Optimal dissimilarity method for choosing distinctive items..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Optimal dissimilarity method for choosing distinctive items... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3069215