Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-10-16
2001-03-20
Lintz, Paul R. (Department: 2771)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C341S055000
Reexamination Certificate
active
06205444
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to the field of aligning data sequences, and, more specifically, to a method and system for aligning multiple data sequences to achieve the best commonality of the sequences.
2. Discussion of the Prior Art
The problem of aligning a large number of sequences appears very often and in many different application areas (e.g. biology, data mining in databases, computer security etc.). It has been observed that most traditional methods of aligning multiple sequences do not work well on large number of sequences on the order of N>10.
The multiple sequence alignment problem has been studied for at least the last fifteen years, in the context of Computational Biology, it is well known that this is a rather difficult problem. Contributing to the difficulty in solving the problem is that in its most general form, it is difficult to model to the satisfaction of Evolutionary Biologists, Geneticists and other users. The most popular and successful approach to date has been a dynamic programming technique using different mechanisms of scores that is a function of the edit distance along with gap penalties to evaluate the similarity of the different sequences. Dynamic programming relies on identifying good penalty scores for matches and mis-matches. This is difficult to realize in real-world applications. The method is best suited to small sequence numbers on the order of N<6.
For the situation where there are more than two sequences to be aligned, one of the prior art approaches has been to perform a pairwise alignment where two of the N sequences are analyzed at a time and an N-wise alignment is built from the pairwise alignments. This approach works well for small values of N on the order of N<6, however, for large values of N additional constraints are required to give meaningful alignments.
It is apparent that the problem of aligning multiple sequences where the number of sequences is large is computationally a very demanding one.
SUMMARY OF THE INVENTION
The problems stated above and the related problems of the prior art are solved with the method and system according to the present invention. A method is provided that is specifically suited to aligning a large number of sequences, on the order of N>10, simultaneously, that brings out the best commonality of the N-sequences. The method is built on top of an existing method for identifying or discovering motifs in sequences. A motif can be informally defined as a repeated pattern that appears in two or more input sequences. The method uses some, if not all, of these identified motifs to give a good alignment. The present invention also provides an additional constraint, K, as a user defined control parameter. The additional parameter constrains the alignment of the N sequences to have at least K of the N sequences agree on a character, whenever possible, in the alignment. The alignment number, K, provides a natural constraint for dealing with a large number of sequences in that a commonality across most, if not all sequences is required to be detected.
Given a set of N input character sequences, the present invention aligns the N sequences in a way that brings out the best commonality of the N character sequences. At a motif discovery stage, the method of the present invention first identifies any motifs which may exist in the input sequences. At a pruning and alignment stage, the method detects one or more infeasible sets from the motifs identified at the motif discovery stage. Each infeasible set is comprised of a subset of those identified motifs which preclude an alignment which brings out the best commonality. Once identified, one or more motifs from the infeasible sets are removed (“pruned”), from consideration when aligning the sequences at an alignment stage. The remaining motifs comprise a feasible set which is then used to align or render the sequences to maximize a pre-specified cost function.
The sequence alignment method of the present invention is advantageous for the following reasons; 1) The approach is efficient, particularly on large numbers of long sequences, 2) a direct N-wise alignment is achieved by not composing the alignments from lower order (pairwise) alignments as practiced in the prior art, 3) the approach is adaptable to underlying biological or other models, using the idea of in-exact motif, 4) it allows for easy user intervention and specification in terms of which motif the user must have in the alignments, 5) it gives various alternate solutions, if required, to the user based on different configurations that the user wants to see rather than give a cost value which the user may or may not be able to relate to, 6) the resulting alignment is independent of the order of the input sequences.
The invention and its operation will become more apparent from the following description of an exemplary embodiment and from the accompanying drawings.
REFERENCES:
patent: 5897637 (1999-04-01), Guha
patent: 5930789 (1999-07-01), Agrawal et al.
patent: 5977890 (1999-11-01), Rigoutsos et al.
patent: 5983224 (1999-11-01), Singh et al.
patent: 6061682 (2000-05-01), Agrawall et al.
patent: 6092065 (2000-07-01), Floratos et al.
patent: 6094645 (2000-07-01), Agrawal et al.
Agrawal et al., “Mining Sequential Patterns”, IEEE Computer Society Press, 1063-6382, pp. 25-33, Mar. 1995.
Floratos Aris
Parida Laxmi P.
Rigoutsos Isidore
F. Chau & Associates LLP
International Business Machines - Corporation
Lintz Paul R.
LandOfFree
Multiple sequence alignment system and method does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Multiple sequence alignment system and method, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Multiple sequence alignment system and method will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2528453