Data processing: measuring – calibrating – or testing – Measurement system in a specific environment – Biological or biochemical
Reexamination Certificate
2000-03-20
2002-09-03
Zeman, Mary K. (Department: 1631)
Data processing: measuring, calibrating, or testing
Measurement system in a specific environment
Biological or biochemical
C702S020000, C700S030000, C707S793000, C704S010000, C704S243000, C435S006120
Reexamination Certificate
active
06446011
ABSTRACT:
FIELD OF THE INVENTION
The present invention generally relates to data sequence processing methodologies and, more particularly, to methods and apparatus for detecting basic repeating subsequences within the data sequences.
BACKGROUND OF THE INVENTION
All living organisms carry the information that allow them to live, in the form of long molecules called DNA (Deoxyribonucleic acid). The structure of these molecules can be idealized as a long chain of four types of chemical components (called nucleotides or bases): adenines, thymines, cytosines and guanines. It is customary to denote these nucleotides by the first letter of its denomination. Thus, we have four types of nucleotides A, T, C, and G. We can say that the set {A,T,C,G} is an alphabet (nothing less than the alphabet of life) in the sense that the order of these letters in the DNA is the genetic text that carries the information needed to build each of our proteins. Proteins are the actual molecules that support and connect the cells of living organisms, mediate communication with the environment and perform most of the chemical reactions needed for living. Proteins are also chain-like molecules which are built out of a set of twenty constituents called amino acids. These twenty amino acids can be conceptualized as an alphabet and the proteins as sentences in this alphabet. An elementary, yet deep, discussion of these alphabets can be found in B. Lewin, “Genes VI,” Oxford University Press, 1997. The set of all ordered DNA letters constituting the DNA of an organism is called its genome. The set of all sequences of ordered amino acids (the protein alphabet) of an organism is called its proteome.
Most genomes contain regions that consist of a basic sequence that repeats itself (albeit with slight variations), head to tail, a number of times. These arrangements are called tandem repeats (TR). Both the length of the basic sequence that forms the TR, hereafter called the period of the repeat, and the number of repetitions, hereafter called the copy number, vary from TR to TR within the same genome. The overall amount of repeated genomic material is different in different genomes, ranging from approximately 10% in mammalian genomes to 50% in some arthropods, see, e.g., B. Lewin, “Genes VI,” Oxford University Press, 1997.
Even though the functional origin of the repeated sequences is still a matter of debate,. see, e.g., B. Charlesworth, P. Sniegowski and W. Stephan, “The Evolutionary dynamics of repetitive DNA in Eukaryotes,” Nature, 371, 215-220, 1994 (the prevailing conjecture being that short repeats might have a structural function), it is clear that their very presence has important consequences during chromosome pairing, allowing for misalignments to occur, and thus causing repeat clusters to be highly polymorphic, see, e.g., B. Lewin, “Genes VI,” Oxford University Press, 1997. This polymorphism is used in the forensic technique known as DNA fingerprinting, see. e.g., K. Inman, and N. Rudin, “An Introduction to Forensic DNA Analysis,” CRC Press, Boca Raton, Fla., 1997. Longer repeats tend to have a very complex and hierarchical internal structure, whose details might give hints about DNA molecular mechanisms such as misaligned recombination and saltatory replication. For a more extensive discussion of different biological aspects of TR, see, e.g., B. Lewin, “Genes VI,” Oxford University Press, 1997; and S. B. Primrose, “Principles of Genome Analysis,” Blackwell Science, Cambridge, Mass., 1995.
Despite the fact that the discovery of these surprising repetitive structures took place about 25 years ago, a detailed characterization of the main parameters of the repeats, such as distribution of lengths, distribution of periods, phylogeny of its hierarchical structure, how often they belong to coding as opposed to non-coding regions, etc., is missing. The reasons for this are twofold. On the one hand, only recently enough data to address these questions became available. On the other, and in part due to the first reason, fast algorithms to interrogate the genomes came into existence only recently.
The first attempts to systematically find repeats in sequences dealt with exact repeats, see, e.g., R. M. Karp, R. E. Miller and A. L. Rosenberg, “Rapid identification of repeated patterns in strings, trees and arrays,” In Proc. 4th Annu. ACM Symp. Theory of Computing, 125-136, 1972; and A. Milosavljevic and J. Jurka, “Discovering Simple DNA sequences by the Algorithmic Significance Method,” Comput. Appl. Biosci., 9, 407-411, 1993. The constraint of exact repetition was relaxed (see, e.g., G. Landau and J. Schmidt, “An algorithm for approximate tandem repeats,” Z. Galil, A. Apostolico, M. Crochemore and U. Manber (editors), Combinatorial Pattern Matching, volume 684 of Lecture Notes in Computer Science, 120-133, Springer-Verlag, 1993; S. K. Kannan and E. W. Myers, “An algorithm for locating non-overlapping regions with maximum alignment score,” Z. Galil, A. Apostolico, M. Crochemore and U. Manber (editors), Combinatorial Pattern Matching, volume 684 of Lecture Notes in Computer Science, 74-86, Springer-Verlag, 1993; and G. Benson, “A space efficient algorithm for finding the best non-overlapping alignment score,” M. Crochemore and D. Gusfield (editors), Combinatorial Pattern Matching, volume 807 of Lecture Notes in Computer Science, 1-14, Springer-Verlag, 1994), where approximate repeats appearing exactly twice either contiguously (the G. Landau and J. Schmidt article) or in a non-overlapping fashion (the S. K. Kannan and E. W. Myers article; and the G. Benson article) were considered. An alternative strategy has also been followed. If the length T of the repeating unit is short enough that all words of that length (see, e.g., E. Rivals and O. Delgrange, “A first step towards chromosome analysis by compression algorithms,” N. G. Bourbakis (editor), First International IEEE Symposium on intelligence in Neural and Biological Systems, 233-239, IEEE Computer Society Press, 1995) or all the possible ordered k elements in a window of size T (k<T) (see, e.g., X. Guan and E. C. Uberbacher, “A Fast Look-up Algorithm for Detecting Repetitive DNA Sequences,” Pacific Symposium on Biocomputing, 718-719, 1996) can be generated and each used as candidate for the repeating unit, then these candidates can be used as probes and “fitted” to the sequence under interrogation. The same approach can be followed if the repeating unit is given, see, e.g., V. Fischetti, G. Landau, J. Schmidt and P. Sellers, “Identifying periodic occurrences of a template with applications to protein structure,” Z. Galil, A. Apostolico, M. Crochemore and U. Manber (editors), Combinatorial Pattern Matching, volume 644 of Lecture Notes in Computer Science, 7486, Springer-Verlag, 1992. In the article. by G. Benson and M. Waterman, “A method for fast database search for all k-nucleotide repeats,” Nucelic Acids Research, 22, 4828-4836, 1994, an algorithm that detects TR of a given size was proposed.
All of the above algorithms are practical for a specific range of input parameters. For example, some ofthem are exhaustive algorithms (e.g., the S. K. Kannan and E. W. Myers algorithm; and the G. Benson algorithm), with time complexity being O(n
2
polylog(n)) where n is the length of the full sequence, and thus n cannot be too large. This is a serious constraint if entire chromosomes are interrogated (for example, n is about 1,500,000 for chromosome 4 of
S. cerevisiae
). For other algorithms (e.g., the R. M. Karp, R. E. Miller and A. L. Rosenberg algorithm; and the A. Milosavljevic s and J. Jurka algorithm), the constraint of exact repeats is untenable for biological sequences of DNA, where mutations, insertions and deletions constitute important ingredients. Tandem repeats appear in variable copy numbers, and therefore the algorithms that limit the search to only two repeats are too restrictive, see, e.g., the G. Landau and J. Schmidt algorithm. Moreover, the size of the repeating unit can be of a few hundreds, and thus the algorithms that are practical only for short
Floratos Aris
Rigoutsos Isidore
Stolovitzky Gustavo A.
August Casey P.
International Business Machines - Corporation
Ryan & Mason & Lewis, LLP
Zeman Mary K.
LandOfFree
Tandem repeat detection using pattern discovery does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Tandem repeat detection using pattern discovery, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Tandem repeat detection using pattern discovery will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2906757