Data processing: measuring – calibrating – or testing – Measurement system – Statistical measurement
Reexamination Certificate
2000-06-21
2003-05-27
Hoff, Marc S. (Department: 2857)
Data processing: measuring, calibrating, or testing
Measurement system
Statistical measurement
C930S310000
Reexamination Certificate
active
06571199
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to database searches and, more particularly, to methods and apparatus for detecting sequence homology between a query sequence and sequences in a database in association with a given application, e.g., genetic research.
BACKGROUND OF THE INVENTION
In the area of genetic research, the first step following the sequencing of a new gene is an effort to identify that gene's function. The most popular and straightforward methods to achieve that goal exploit the following biological fact—if two peptide stretches exhibit sufficient similarity at the sequence level (i.e., one can be obtained from the other by a small number of insertions, deletions and/or amino acid mutations), then they probably are biologically related. Examples of such an approach are described in A. M. Lesk, “Computational Molecular Biology,” Encyclopedia of Computer Science and Technology; A. Kent and J. G. Williams editors, 31:101-165, Marcel Dekker, New York, 1994; R. F. Doolittle, “What we have learned and will learn from sequence databases,” Computers and DNA, G. Bell and T. Marr editors, 21-31, Addison-Wesley, 1990; C. Caskey, R. Eisenberg, E. Lander, and J. Straus, “Hugo statement on patenting of DNA,” Genome Digest, 2:6-9, 1995; and W. R. Pearson, “Protein sequence comparison and protein evolution,” Tutorial of Intelligent Systems in Molecular Biology, Cambridge, England, 1995.
Within this framework, the question of getting clues about the function of a new gene becomes one of identifying homologies in strings of amino acids. Generally, a homology refers to a similarity, likeness, or relation between two or more sequences or strings. Thus, one is given a query sequence Q (e.g., the new gene) and a set D of well characterized proteins and is looking for all regions of Q which are similar to regions of sequences in D.
The first approaches used for realizing this task were based on a technique known as dynamic programming. This approach is described in S. B. Needleman and C. D. Wunsch, “A General Method Applicable To The Search For Similarities In The Amino Acid Sequence Of Two Proteins,” Journal Of Molecular Biology, 48:443-453, 1970; and T. F. Smith and M. S. Waterman, “Identification Of Common Molecular Subsequences,” Journal Of Molecular Biology, 147:195-197, 1981. Unfortunately, the computational requirements of this method quickly render it impractical, especially when searching large databases, as is the norm today. Generally, the problem is that dynamic programming variants spend a good part of their time computing homologies which eventually turn out to be unimportant.
In an effort to work around this issue, a number of algorithms have been proposed which focus on discovering only extensive local similarities. The most well known among these algorithms are referred to as FASTA and BLAST. The FASTA algorithm is described in W. R. Pearson, and D. J. Lipman, “Improved tools for biological sequence comparison,” Proc. Natl. Acad. Sci., 85:2444-2448, 1988; and D. J. Lipman, and W. R. Pearson, “Rapid and sensitive protein similarity searches,” Science, 227:1435-1441, 1989. The BLAST algorithm is described in S. Altschul, W. Gish, W. Miller, E. W. Myers, and D. Lipman, “A basic local alignment search tool,” J. Mol. Biology, 215:403-410, 1990. In the majority of the cases, increased performance is achieved by first looking for ungapped homologies, i.e., similarities due exclusively to mutations and not insertions or deletions. The rationale behind this approach is that in any substantial gapped homology between two peptide strings, chances are that there exists at least a pair of substrings whose match contains no gaps. The locating of these substrings (the ungapped homology) can then be used as the first step towards obtaining the entire (gapped) homology.
Identifying the similar regions between the query and the database sequences is, however, only the first part (the computationally most demanding) of the process. The second part (the one that is of interest to biologists) is evaluating these similarities, i.e., deciding if they are substantial enough to sustain the inferred relation (functional, structural or otherwise) between the query and the corresponding data base sequence(s). Such evaluations are usually performed by combining biological information and statistical reasoning. Typically, similarity is quantified as a score computed for every pair of related regions. Computation of this score involves the use of gap costs (for gapped alignments) and of appropriate mutation matrices giving the evolutionary probability of any given amino acid changing into another. Examples of these matrices are the PAM matrix (see M. O. Dayhoff, R. M. Schwartz and B. C. Orcutt, “A model of evolutionary change in proteins,” Atlas of Protein Sequence and Structure, 5:345-352, 1978) and the BLOSUM matrix (see S. Henikoff and J. G. Henikoff, “Amino acid substitution matrices from protein blocks,” Proc. Natl. Acad. Sci., 89:915-919, 1992). Then, the statistical importance of this cost is evaluated by computing the probability (under some statistical model) that such a score could arise purely by chance, e.g., see S. Karlin, A. Dembo and T. Kawabata, “Statistical composition of high-scoring segments from molecular sequences,” The Annals of Statistics, 2:571-5 81, 1990; and S. Karlin and S. Altschul, “Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes,” Proc. Natl. Acad. Sci., 87:2264-2268, 1990. Depending on the statistical model used, this probability can depend on a number of factors such as: the length of the query sequence, the size of the underlying database, etc. No matter, however, what conventional statistical model one uses there are always the so called “gray areas,” i.e., situations where a statistically unimportant score indicates really a biologically important similarity. Unfortunate as this might be, it is also inescapable; there is after all a limit to how well a statistical model can approximate the biological reality.
An alternative to the inherent difficulty of attaching statistical importance to weak similarities is the use of biological knowledge in deducing sequence descriptors that model evolutionary distant homologies. BLOCKS (see S. Henikoff and J. Henikoff, “Automatic Assembly of Protein Blocks for Database Searching,” Nucleic Acids Research, 19:6565-6572, 1991) is a system that employs pattern-induced profiles obtained over the protein classification defined in the PROSITE (see S. Henikoff and J. Henikoff, “Protein Family Classification Based on Searching a Database of Blocks,” Genomics, Vol. 19, pp. 97-107, 1994) database in order to functionally annotate new genes. The advantage here is that this classification is compiled by experts working with families of proteins known to be related. As a result, even weak similarities can be recognized and used in the annotation process. On the other hand, there is only that much knowledge about which proteins are indeed related and consequently being representable by a pattern. Furthermore, there is always the danger that a family of proteins actually contains more members than is currently thought of. By excluding these other members from consideration, it is possible to get patterns that “over fit” the family, i.e., they are too strict to extrapolate to the unidentified family members.
Therefore, it is evident that there exists a need for methods and apparatus for creating improved pattern dictionaries through unique dictionary formation techniques that permit improved sequence homology detection, as well as a need for methods and apparatus for sequence homology detection, itself, which are not limited to searching only annotated sequences.
SUMMARY OF THE INVENTION
The present invention provides solutions to the above and other needs by providing improved pattern dictionary formation techniques and improved sequence homology detection techniques, as will be described in greater detail below.
In a sequence homology detection a
Floratos Aris
Rigoutsos Isidore
August Casey P.
Hoff Marc S.
Raymond Edward
Ryan & Mason & Lewis, LLP
LandOfFree
Method and apparatus for performing pattern dictionary... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for performing pattern dictionary..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for performing pattern dictionary... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3038240