Data processing: artificial intelligence – Knowledge processing system – Knowledge representation and reasoning technique
Reexamination Certificate
2000-01-06
2003-05-27
Davis, George B. (Department: 2121)
Data processing: artificial intelligence
Knowledge processing system
Knowledge representation and reasoning technique
C341S055000
Reexamination Certificate
active
06571230
ABSTRACT:
FIELD OF THE INVENTION
The present invention generally relates to data sequence processing methodologies and, more particularly, to methods and apparatus for discovering and generating motifs with respect to sequences of data such as, for example, sequences of characters, character sets and real numbers.
BACKGROUND OF THE INVENTION
Given an input sequence of data, a “motif” is a repeating pattern, possibly interspersed with don't-care characters, that occurs in the sequence. The data could be characters or sets of characters or real values. In the first two cases, the number of motifs could potentially be exponential in the size of the input sequence and in the third case there could be uncountably infinite number of motifs. Typically, the higher the self similarity in the sequence, the greater is the number of motifs in the data.
Motif discovery on such data, such as repeating DNA or protein sequences, is a source of concern since such data exhibits a very high degree of self-similarity (repeating patterns). Usually, this problem of an exploding number of motifs is tackled by pre-processing the input, using heuristics, to remove the repeating or self-similar portions of the input. Another way of trimming down the number of motifs is to use a “statistical significance” measure. However, due to the absence of a good understanding of the domain, there is no consensus over the right model to use.
Thus, there is a trend towards model-less motif discovery in different fields. To keep the problem manageable, it is useful to identify a small number of motifs that capture important information about the family of motifs. However, no conventional method exists which is able to satisfactorily identify a small enough number of such motifs to provide applications interested in utilizing the motifs with manageable and practical results.
SUMMARY OF THE INVENTION
The present invention provides methods and apparatus for detecting motifs in sequences of data, such as, for example, sequences of characters, character sets and real numbers, such as to provide applications interested in utilizing the motifs with manageable and practical results. Particularly, the invention provides that, for any sequence, there exists only a linear number of special motifs and every other motif can be simply generated from them. We name these special motifs “irredundant motifs.” The result is meaningful also from an algorithmic viewpoint, since the ideas from the proof can be used to design a polynomial time algorithm to detect these irredundant motifs. This bound on the number of useful motifs gives validation to motif-based approaches, since the total number of irredundant motifs does not explode. This result is of significance to most applications that use pattern discovery as the basic engine such as data mining, clustering and matching. This family of irredundant motifs is also very characteristic of the family of all the motifs. For example, in applications such as multiple sequence alignment, we have shown that the irredundant motifs suffice to obtain the alignment. However, in applications that use the motifs to extract signature motifs of sets of sequences, all the motifs, including the redundant ones, may be of relevance.
In one illustrative aspect of the present invention, a method of detecting repeating patterns (i.e., motifs) in an input data sequence, wherein the data sequence includes elements from an element alphabet, comprises the steps of: (i) obtaining the input data sequence; (ii) constructing a set of patterns from the input data sequence, each pattern being unique and including one or more elements from the input data sequence, and each pattern having a list associated therewith representing the location of the pattern in the input data sequence; (iii) removing a pattern from the set when the location list of the pattern is a union of the location lists of at least two other patterns in the set; (iv) for each pair of compatible patterns in the set, constructing a new pattern which is a concatenation of the pair of compatible patterns, each new pattern having a location list associated therewith; and (v) storing the patterns, and associated location lists, remaining after the removing step and the new pattern constructing step as the detected repeating patterns. The one or more patterns may further include one or more don't care positions, i.e., don't care characters. As mentioned, the elements of the input data sequence may be in various forms, e.g., characters, character sets or real numbers.
In accordance with such a methodology, as will be explained below, no more than 3n repeating patterns are detected given an input data sequence which includes n elements. The no more than 3n repeating patterns are maximal and non-redundant, i.e., they are irredundant patterns or motifs. From these irredundant patterns, patterns that are non-maximal and/or redundant may be generated. In one embodiment, the non-maximal and/or redundant patterns are generated in accordance with one or more annotated tries.
In one exemplary application, as will be explained below, the input data sequence is a protein sequence and the inventive methodologies are used in accordance with protein sequence homology detection.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
REFERENCES:
patent: 5977890 (1999-11-01), Rigoutsos et al.
Wu et al, “Motif. Neural Network Design for Large-Scale Protein Family Identification”, IEEE International Conference on Neural Networks, Jun. 1997.*
Gao et al, “Motif Detection in Protein sequences”, IEEE International Workshop on Groupware String Processing and Information Retrieval Symposium, Sep. 1999.*
L. Parida et al., “Pattern Discovery on Character Sets and Real-valued Data: Linear Bound on Irredundant Motifs and an Efficient Polynomial Time Algorithm,” Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 297-308, 2000.
Yamaguchi et al., “Protein Motif Discovery from Amino Acid Sequence by Sets of Regular Patterns,” Research Report of Information Processing Society, Information Processing Society, vol. 95, No. 76, 33-40, (Jul. 28, 1995). (English Translation).
Rigoutsos et al., “Bioinformatics; Combinational Pattern Discovery in Biological Sequences: the TEIRESIAS Algorithm,” Oxford University Press, vol. 14, No. 1, 55-67 (1998).
L.A. Parida, “Algorithmic Techniques in Computational Genomics,” Ph.D. thesis, Courant Institute of Mathematical Sciences, New York University, (1998).
A. V. Hoe et al., “Elementary Graph Algorithms,” Data Structure and Algorithms, Addison-Wesley Publishing Company, Chapter 23, pp. 465-493, 1983.
R. Agrawal et al., “Fast Discovery of Association Rules,” Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, Chapter 12, pp. 307-328, MA, 1995.
D. Gusfield, “Linear-Time Construction of Suffix Trees,” Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge University Press, Chapter 6, pp. 94-119, New York, 1997.
Y. Gao et al., “Motif Detection in Protein Sequences,” pp. 1-12, Apr. 16, 1999.
L.P. Parida, “Algorithmic Techniques in Computational Genomics,” Ph.D. thesis, Courant Institute of Mathematcial Sciences, New York University, Part II, Sequence Analysis, Chapter 9, Pattern Discovery, pp. 130-142, Sep. 1998.
L. Parida et al., “An Approximation Algorithm for Alignment of Multiple Sequences Using Morif Discovery,” Journal of Combinatorial Optimization, pp. 1-36, 1999.
L. Parida et al., “MUSCA: An Algorithm for Constrained Alignment of Multiple Data Sequences,” Genome Informatics, pp. 1-15, 1999.
August Casey P.
Davis George B.
International Business Machines - Corporation
Ryan & Mason & Lewis, LLP
LandOfFree
Methods and apparatus for performing pattern discovery and... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Methods and apparatus for performing pattern discovery and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods and apparatus for performing pattern discovery and... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3029635