Data processing: measuring – calibrating – or testing – Measurement system in a specific environment – Biological or biochemical
Reexamination Certificate
1999-12-03
2002-08-13
Zeman, Mary K. (Department: 1631)
Data processing: measuring, calibrating, or testing
Measurement system in a specific environment
Biological or biochemical
C702S020000, C708S003000, C708S005000, C708S203000, C708S210000, C708S230000, C708S400000, C712S036000
Reexamination Certificate
active
06434488
ABSTRACT:
FIELD OF THE INVENTION
The present invention generally relates to data comparison methodologies and, more particularly, to methods and apparatus for performing an alignment free and rapid determination of differences between a test data set, e.g., a probe protein sequence, and known data sets, e.g., a given protein sequence data base.
BACKGROUND OF THE INVENTION
It is known that there are various deficiencies in existing methods associated with certain applications in bioinformatics and computational genomics. These deficiencies are of increasing concern as the level of genomic data mounts. Of particular concern are existing algorithms designed to detect similar protein sequences in a known database given a test sequence. For example, many of such existing algorithms implicitly or explicitly require a sequence alignment operation despite the fact that it is desirable to avoid such an operation. Further, none of the existing algorithms are particularly well suited for efficient utilization on a relational data base. Still further, current protein sequence search and sequence-pattern recognition methods depend on very sophisticated mathematical methodologies.
Examples of such existing methodologies include Smith-Waterman (M. S. Waterman, “Introduction to Computational Biology,” Chapman & Hall, London (Smith-Waterman), 1988), BLAST (S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, J. Mol. Biol., 215, 403-410 (BLAST), 1990), FASTA (W. R. Pearson and D. J. Lipman, Proc. Nat. Acad. Sci.uS, 85, 2444-2448 (FASTA), 1988). While the above methodologies are approaches inherently based on, and aiding, protein sequence alignment in order to abstract common features, more recent approaches such as TERESIAS and SPLASH eschew the need for alignment by directly abstracting common patterns, see, e.g., I. Rigoustos, A. Floratos, and C. Ouzounis, IBM RC20803(92166), Apr. 26, 1976; A. Califano and I. Rigoustos, In Proc. Symp. on Intelligent Systems for Mol. Biol., Washington; G. Stolovitsky and A. Califano, “Discrete Applied Mathematics Series,” ed. P. Penver; and I. Rigoustos and A. Floratos, In Proc. 2nd Annual ACM Intnl. Conf. of Comp. Mol. Biol., 1993. By assembling directories of patterns in known protein sequences and pointers to their source proteins, the latter three approaches can also be used to sensitively identify protein sequences which are related. However, these latter approaches require the derivation, continued updating and use of large dictionaries of patterns or “motifs.” Also, none of the approaches are especially well suited to relational databases.
SUMMARY OF THE INVENTION
The present invention provides methods and apparatus for performing an alignment free and rapid determination of differences between a test data set and known data sets. In a broad aspect of the invention, a method for generating data characterizing an item described by an ordered string of characters, comprises the steps of: (i) for a set of separation metrics each representing a unique number of positions of separation between arbitrary characters in a character group in the ordered string of characters, associating first with each separation metric; generating a set of character groups, wherein each character group comprises at least two characters contained within the ordered string of characters; and (ii) for at least one given character group in the set of character groups, for each given separation metric in the set of separation metrics, generating second data representing number of occurrences that the given character group satisfies the given separation metric; generating third data associated with the given character group, wherein the third data is based upon the second data and the first data; and storing the third data in memory for subsequent use.
Preferably, said item is a protein sequence, and each character in the ordered string of characters represents at least one amino acid. Further, each character group may comprise a character pair contained within the ordered string of characters, and wherein each separation metric represents a unique number of positions of separation between arbitrary characters in a character pair in the ordered string of characters. The third data may be stored in an entry associated with the given character pair, wherein the entry is part of table of entries stored in memory, wherein each entry corresponds to a unique character pair. The first data may comprise a set of unique prime numbers corresponding to set of separation metrics. The set of unique prime numbers may be an ordered set beginning with the number 2. The step of generating the third data associated with the given character pair may be generated by: for each given separation metric in the set of separation metrics, transforming the second data using a function based upon the unique prime number associated with the given separation metric; and adding the transformed first data to a running sum associated with the given character pair. The transforming step may comprise multiplying the second data by the natural logarithm of the unique prime number associated with the given separation metric.
Further, the method may provide that the table entries for a given item be compared with the table entries for another item to provide a measure of similarity between the two items. The comparison between table entries for the two items is preferably based upon the numerical difference between corresponding table entries. Still further, the measure of similarity between the two items is preferably computed by: summing the absolute value of the numerical difference between corresponding table entries, and normalizing the resultant sum.
In a specific embodiment in the context of genetic information processing, the present invention provides a methodology of representing protein sequences or parts thereof by a “fingerprint” or “recognition silhouette” which identifies a protein as belonging to a particular class. To achieve this, the present invention enables rapid estimation of degrees of relatedness in a form of “pattern content” sense, which also has more specific routine applications. The particular significance is that a data base searched can consist of fingerprints representing whole sequences or parts of sequences (including segments of sequences sometimes called “blocks,” domains” or “sub-domains,” and such fingerprints may serve as an alternative to storage of specific sequence, or can represent whole families, or recreate parts of proteins such as domains or even relatively small fragments and consensus sequences.
To do this, pattern matching and its application in genomics to searching for homologousprotein sequences, is expressed in some of the concepts of prime number theory. In accordance with the invention, patterns can be seen as recurrent prime factors which one may deem irreducible in string data, or relatively prime in the context of the problem addressed, and their isolation from the embedding string data corresponds to factorization. Though this may be simple analogy rather than a “deep relationship,” comparison leads to some useful preliminary screening tools for showing the relationships between two sequences of symbols.
To do this, patterns in a special representational form are identified. This special representational form is referred to as a “finger matrix.” A simple mapping by pattern data to prime numbers is defined, and is used, in turn, to encode multiple symbol-separation data into the simple scalar elements of a 20×20 residue (amino acid type) array, in one embodiment, this matrix being characteristic of individual protein sequences (assuming a sliding window of 10 residues).
Generally, a homology refers to a similarity, likeness, or relation between two or more sequences or strings. However, as will be explained, the invention provides homology detection through a methodology of identifying differences between a given sample and a known data base. In a specific application, the method is employed to measure the differences between sequences, rather than look
August Casey P.
Ryan & Mason & Lewis, LLP
Zeman Mary K.
LandOfFree
Alignment free methodology for rapid determination of... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Alignment free methodology for rapid determination of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Alignment free methodology for rapid determination of... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2940388