Method and apparatus for pattern discovery in protein sequences

Image analysis – Applications – Biomedical applications

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S217000, C707S793000

Reexamination Certificate

active

06373971

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Technical Field
The invention relates to the field of pattern discovery, and, more specifically, to pattern discovery in 1-dimensional event streams.
2. Description of the Related Art
An event stream is a sequence of events which are taken from a finite set of possible events. This latter set can be thought of as an alphabet in which case an event stream is a string over that alphabet. The term sequence used below may refer to an event stream or a sequence of characters belonging to an alphabet. A pattern is a specific set of letters with a given spatial arrangement, typically described as a regular expression.
An example of such a pattern is “AF..H..RR” where the dots are used to indicate that the respective positions could be occupied by “any” letter (“don't care” character). An event string is said to match the pattern at a given position i, if and only if the letters of the pattern all match the corresponding letters of the event string, when placed at offset i; a don't care character is assumed to match any letter of the alphabet. For example, “AF..H..RR” matches “HWIRTAFLKHAARRIKWL” at position 6.
The problem of pattern discovery is computationally a very demanding one. Indeed, it can be proven to be NP-hard (unless the type of patterns sought is extremely simple). The problem can be stated as follows:
“Given a set S={s
1
, s
2
, . . . , s
m
} of one ore more sequences s
i
(i.e. strings) over an alphabet &Sgr; of letters and positive integer K, find all the patterns which match K or more of the input sequences in S.”
In this first formulation, what is sought is those patterns that appear in at least K of the sequences of the input set. However, it may happen that a pattern appears in fewer than K of the sequences, but more than once in some of those sequences. In other words, one or more sequences may contain multiple occurrences of a given pattern. Consequently, such a pattern may appear in fewer than K sequences but more than K times when all sequences of S are considered. The definition of the problem can then be modified to capture this case as well:
“Given a set S={s
1
, s
2
, . . . , s
m
} of one or more sequences s
i
over an alphabet &Sgr; of letters and positive integer K, find all the patterns which appear K or more times in the sequences in S.”
The invention that is described below can handle both of these versions of the pattern discovery problem.
The problem of discovering patterns in event streams appears very often and in many different application areas (biology, data mining in databases, computer security etc.). Depending on the particular domain at hand there are many different definitions of the notion of the pattern as well as of what it means for an input to match a particular pattern.
Almost invariably the input is a set S composed of arbitrary strings over a given alphabet &Sgr; and a pattern is any member of a well defined subset C of all the possible regular expression over &Sgr; (e.g. C can be the set of all regular expression with some predefined maximum length). The reasons why the search for patterns is restricted to some subset of all possible regular expressions are both domain-specific and also of a practical nature since (as is explained later) the problem is computationally extremely demanding.
Being a regular expression, every pattern P defines a language L(P) in the natural way: a string belongs to L(P) if it is recognized by the automaton of P. An input string w is said to match a given pattern P if w contains some substring that belongs to L(P). The problem then is to a discover all patterns in C matched by at least a given (user-defined) minimum number of strings in the input set S.
For an illustration, consider the following set of strings over the English alphabet:
S={“LARGE”, “LINGER”, “AGE”}
In this case the pattern “L..GE” has support
2
since it is matched by the first two strings of S (the special character ‘.’ in the pattern indicates a position that can be occupied by any character). The term support is used here to denote the number of input strings matching a given pattern. As another example, the pattern “A*GE” has also support
2
(it is matched by the first and the last strings). Here, the character ‘*’ is used to match substrings of arbitrary length, which can include strings of zero length.
The computational complexity of the pattern discovery problem directly depends on the class C of regular expressions that the patterns belong to. In the most straightforward case a pattern is simply a string over the alphabet &Sgr; (i.e. just a sequence of alphabet symbols—no don't care characters) and the problem of finding all such patterns with a given minimum support can be solved in polynomial time using generalized suffix trees. An example of such a pattern discovery problem is illustrated in Hui, “
Color Set Size Problem with Applications to String Matching
”, Proceedings of the 2nd Sumposium on Combinatorial Pattern Matching, 1992, pp. 230-243.
In almost every other case though, the class C is expressive enough to render the problem NP-hard. The hardness result can be usually shown by a reduction from the longest common subsequence problem, discussed in Garey and Johnson, “
Computers and Intractability: a Guide to the Theory of NP-Completeness
”, 1979, and Maier “
The Complexity of Some Problems on Subsequences and Supersequences
”, Journal of the ACM, 1978, pp. 322-336. What this means in practice is that there can be no algorithm guaranteed to produce, for every possible input, all the required patterns in time polynomial to the input length.
One way to bypass the hardness of the problem is to design approximation algorithms, i.e. algorithms that are guaranteed to work “fast” for every input but do not necessarily find all the existing patterns for every given input. Another approach is to further restrict the patterns that the algorithm can discover so that they can be found efficiently. This is usually left to the discretion of the user by providing him/her with the ability to appropriately set a number of parameters that decide the structure of the patterns to be looked for. A typical example is to allow the user to define the maximum length that a pattern can have. Providing the user with that kind of control over the search space is not unreasonable since an expert user can apply domain knowledge to help a program avoid looking for patterns that are meaningless or impossible for the domain at hand. In fact, this expert knowledge is usually an integral part (in the form of various heuristics) of most of the pattern discovery algorithms that exist in the literature (the disadvantage of this approach is that most of these algorithms are usually inapplicable outside the particular domain which they were designed for). Finally, there are the algorithms that just accept the hardness of the problem and proceed head-on to find all possible patterns. Such algorithms are bound to be inefficient (space and/or time-wise) on some “bad” inputs but, depending on the domain at hand, their overall performance (amortized over all “reasonable” inputs) might be quite satisfactory. In such cases, it becomes very important to introduce a number of heuristics that will speed up the algorithm. The method presented here belongs to this category.
The standard way to assess the “quality” of an algorithm (at least in the context of computer science) is by its time/space complexity. For the pattern discovery problem, though, such a characterization is rather useless. The reason is the NP-hardness of the problem: any worst-case analysis is doomed to give bounds which are super-polynomial (unless the complexity classes P, NP are equal, something extremely unlike). There are, however, other features that are of interest when evaluating a pattern discovery algorithm. Some of these features are:
The subclass C of regular expressions containing the patterns under consideration. In general, it is desirable to have as expressive a class C as possible. Th

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for pattern discovery in protein sequences does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for pattern discovery in protein sequences, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for pattern discovery in protein sequences will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2901999

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.