System and method for discovering patterns with noise

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C702S019000

Reexamination Certificate

active

06691110

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to discovering significant patterns in long sequences of items, and more particularly for a system and method for identifying significant patterns in sequences which include noise.
2. Description of the Related Art
With the large amounts of data being stored and used, discovering and understanding significant patterns in large data sets has become increasingly important. Significant pattern discovery has taken on greater importance in a plurality of new fields as well as new applications for existing technologies. Support (the number of occurrences) of a pattern has been proposed as the metric of significance in an article by R. Agrawal et al., “Mining association rules between sets of items in large databases.” Proc. ACM SIGMOD Conf. on Management of Data, 207-216, 1993. As discussed in Agrawal et al., an input is a set of transactions, and each transaction includes a set of items. The significance of a set of items is determined by the number of transactions which contain this set of items.
Due to the presence of noise, a symbol may be misrepresented by some other symbols. This substitution may prevent an occurrence of a pattern from being recognized and, in turn, slashes the support of that pattern. As a result, a frequent pattern may be “concealed” by the noise. This phenomenon commonly exists in many applications.
For example, in bio-medical study, mutation of amino acids is a common phenomenon studied in the context of biology. Some mutations are proven to occur with a non-negligible probability under normal circumstances and incur little change to an organism's biological functions. For example, the amino acid N in the human body is likely to mutate to D with little impact on behavior. In this sense, the amino acids should not be considered as a-totally independent.
In the area of performance analysis, many system-monitoring applications involve collecting and analyzing attributes that take continuous numerical values. A common approach to process the data is to quantize the domain into categories. If the true value of an attribute is close to the boundary of the quantization, there is a fair chance that the observed value may fall into the adjacent bin and be represented by a different label. It would be desirable if such kind of distortion can be taken into account during a data mining process. In the area of consumer behavior, for example, in the supermarket, consumers frequently buy a slightly different product or brand from their original intent due to various reasons, such as, the desired product was out of stock or misplaced. Allowing obscurity in item matching may unveil the customer's real purchase intention.
This problem becomes critical when the pattern is substantially long because an occurrence of a long pattern is much more subject to distortion caused by noise. In general, the length of a gene expression can range up to a few hundreds of links, if amino acids are taken as the granularity of the analysis. Some clinical studies show that, the amino acids N, K, and V are relatively more likely to mutate to amino acids D, R, and I, respectively. The corresponding gene expressions after the mutation may differ from the standard one. It is more equitable to treat them as possible (degraded) occurrences of the standard expression than to consider them as totally independent gene expressions.
Therefore, a need exists for a system and method which discovers significant patterns while accounting for noise effects. A further need exists for a new measure that accounts for mutation or naturally occurring changes in data in discovering significant patterns.
SUMMARY OF THE INVENTION
A system and method for determining patterns in a data sequence constructs a compatibility matrix which provides a probability between an actual occurrence of an item and an observed occurrence of that or another item between each item in the data sequence. Candidate patterns are generated. The candidate patterns include items in the data sequence. The candidate patterns are checked against the data sequence to determine a match value based on the compatibility matrix, and significant matches are determined based on candidate patterns having the match value above a threshold.
In alternate systems and methods, the items may include symbols and a compatibility matrix may be constructed which includes constructing a matrix such that a match is determined between any two symbols in the data sequence. The compatibility matrix may include rows and columns and each entry in the compatibility matrix corresponds to a row and a column. The match value between two items may include a number between 0 and 1.
The candidate patterns may be checked against the data sequence to determine a match value based on the compatibility matrix. This may include, for a pattern p and a sequence s of symbols, determining an overall match value of p with respect to s by aggregating p with respect to each subsequence s′ with 1 symbols in s. The determining of an overall match value of p with respect to s may include determining a match value between p and s′ by taking a product of the match value between symbols at each position in the data sequence.
The probability between an actual occurrence of an item may determined by experiment or expert opinion. The items may include symbols and candidate patterns may be generated using a level-wise approach wherein at each level one additional symbol is added to a total number of symbols considered in a candidate pattern. The candidate patterns may include a significant pattern if all sub patterns of the candidate pattern satisfy the threshold. Significant matches may be determined based on candidate patterns having the match value above a threshold such that for each candidate pattern, a match value of that candidate pattern is verified against the input sequence to determine a set of patterns that satisfy the threshold.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.


REFERENCES:
patent: 5129002 (1992-07-01), Tsuboka
patent: 5189709 (1993-02-01), Wang et al.
patent: 5710864 (1998-01-01), Juang et al.
patent: 5794198 (1998-08-01), Takahashi et al.
patent: 5809499 (1998-09-01), Wong et al.
patent: 5832108 (1998-11-01), Fukita et al.
patent: 5920839 (1999-07-01), Iso
patent: 6108666 (2000-08-01), Floratos et al.
patent: 6278799 (2001-08-01), Hoffman
patent: 6373971 (2002-04-01), Floratos et al.
patent: 6400853 (2002-06-01), Shiiyama
patent: 6401043 (2002-06-01), Stanton, Jr. et al.
patent: 6408250 (2002-06-01), Grate et al.
patent: 6466874 (2002-10-01), Eisenberg et al.
patent: 2003/0036854 (2003-02-01), Desjarlais
Park et al “Mining association rules with adjustable accuracy”, ACM 1997, pp. 151-160.*
Guha et al “CURE: an efficient clustering algorithm for large databases”, ACM 1998, pp. 73-84.*
Agrawal et al., “Mining Association Rules Between Sets of Items In Large Databases,” Proc. ACM SIGMOD Conf. on Management of Data. 207-216. 1993.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for discovering patterns with noise does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for discovering patterns with noise, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for discovering patterns with noise will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3328552

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.