System and method for mining patterns from a dataset

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S225000

Reexamination Certificate

active

06772152

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to data mining, and more particularly to mining long patterns in a large data set.
2. Description of the Related Art
Finding patterns in data has been an important but difficult task in many industries. Pattern discovery is of increasing importance in many applications including but not limited to biology study, consumer behavior analysis, system performance analysis, etc. A pattern discovery problem may be formulated as follows. Let D={d
1
, d
2
, . . . , d
m
} be a set of literals. A pattern P is a subset of items in D. A pattern consisting of k items is usually referred to as a k-pattern. Given two patterns P
1
and P
2
, P
1
is called a sub-pattern of P
2
if P
1
can be generated by removing some item(s) from P
2
. In such a case, P
2
is called a super-pattern of P
1
. For example, d
1
d
2
is a sub-pattern of d
1
d
2
d
3
. The sub-pattern/super-pattern relationship defines a lattice among all patterns. Given a data set T and a pattern P, let s(P) denote the significance of P in T. The goal is to discover all patterns {P|s(P)≧min_sig}, where min_sig is a user-specified threshold. A pattern is also referred to as a significant pattern P if s(P)≧min_sig. Otherwise, it is called an insignificant pattern. Serious challenges are posed to the design of a mining algorithm because the data set is typically very large (i.e., only a small fraction of the entire data set can be held in memory at once) and patterns may be substantially long (i.e., including a large number of items or events). Even with the help of the well-known Apriori property, the traditional level-wise algorithm becomes very slow.
The Apriori property states that the significance of a pattern is always less than or equal to the significance of any of its sub-patterns. This leads to a level-wise iterative evaluation of significance of patterns: at the kth level, all candidate k-patterns are evaluated via a scan of the data and the set of significant k-patterns are identified and used to generate candidate (k+1)-patterns according to the Apriori property. The process ends when no new candidate pattern can be generated. For example, if d
1
d
2
, d
1
d
3
, d
1
d
4
, d
1
d
5
, . . . , are significant 2-patterns, then d
1
d
2
d
3
, d
1
d
2
d
4
, d
1
d
3
d
4
, d
1
d
2
d
5
, d
1
d
3
d
5
, d
1
d
4
d
5
, . . . , are candidate 3-patterns in the level-wise search. It is easy to see that k scans of the data are required if a significant pattern may consist of up to is k items.
Some effort has been made to further improve performance of finding significant patterns, especially to address the inefficiency incurred by a relatively large value of k. MAXMINER (available commercially from IBM) introduced a “look-ahead” strategy in addition to the traditional level-wise approach. During the generation of candidate (k+1)-patterns from significant k-patterns, some additional candidate patterns of higher levels are also generated and evaluated together with the candidate (k+1)-patterns. In the above example, d
1
d
2
d
3
d
4
d
5
will also be generated as a candidate pattern. Note that if d
1
d
2
d
3
d
4
d
5
is significant, then any of its sub-patterns is significant without further test. Though this approach can reduce the number of scans of the data to some extent, such reduction may be insufficient and not guaranteed in many cases, especially when significant patterns include a large number of items. As a result, the MAXMINER is only suitable to mining patterns that include items in the range of dozens.
Another approach includes the use of sampling. In this approach, a set of samples are first gathered, and the significant patterns in the sample data are computed. F represents the set of significant patterns in the sample data and their immediate super-patterns. The significances of patterns in F are then computed based on the entire dataset and serve as the (advanced) starting position of a level-wise search that eventually identifies all significant patterns. This strategy is efficient if the number of significant patterns that fail to be recognized from the sample is small, which is typically the case under the assumption of a reasonably large sample size and a relatively short pattern length. However, the number of candidate patterns may be considerably large and may not fit into the main memory all at once. In turn, multiple scans of the entire data set may be required.
Therefore, a need exists for a system and method which mines patterns from a large data set. A further need exists for an efficient method which mines data from a large data set only a few scans.
SUMMARY OF THE INVENTION
A system and method are provided for discovering significant patterns from a list of records in a dataset. Each record includes a set of items, and each significant pattern includes a subset of items such that a significance of the pattern exceeds a significance level. A significance is computed for each item in the list of records to determine significant items. The records are randomly sampled to select a sample portion of the records. Ambiguous patterns are identified against the sample portion of the records and verified against the entire list of records in the dataset. This provides an efficient method for data mining.
In other embodiments, a significance computation may be performed for each item in a single scan of the list of records. The step of sampling the records may include choosing a sample size according to a size of available memory space for storing sampled records.
The step of identifying ambiguous patterns may include an iterative loop, wherein a kth iteration identifies all ambiguous k-patterns. The method may include the steps of generating candidate k-patterns, computing the significance of each candidate k-pattern in the sample portion and labeling candidate k-patterns as significant, ambiguous, or insignificant according to the significance of each candidate k-pattern in the sample portion.
A Chernoff Bound may be used to label candidate k-patterns. A domain of each candidate k-pattern may be computed from the significance of each involved item from the list of records of the dataset. Ambiguous patterns may be verified against the entire list of records. This may include pruning ambiguous patterns by employing an ordered pruning method. The ambiguous patterns may be pruned to fit together in a same memory space.
The ordered pruning method may include an iterative process, wherein each iteration may include computing a set of halfway patterns in a space of ambiguous patterns, determining significances of the halfway patterns in the entire list of records of the dataset and reducing the space of ambiguous patterns.


REFERENCES:
patent: 5748769 (1998-05-01), Nishimura et al.
patent: 5825925 (1998-10-01), Baird et al.
patent: 5940535 (1999-08-01), Huang
patent: 6001562 (1999-12-01), Milosavljevic
patent: 6289328 (2001-09-01), Shaffer
patent: 6353679 (2002-03-01), Cham et al.
patent: 6400853 (2002-06-01), Shiiyama
patent: 6507843 (2003-01-01), Dong et al.
Liuni et al “A new parallel algorithm for computation of statistically significant patterns in DNA sequences”, IEEE 1993, pp. 605-612.*
Dimauro et al “Classification of ambiguous patterns”, IEEE 1992, pp. 520-524.*
Zhang et al “Exploring constraints to efficiently mine emerging patterns from large high-dimensional datasets”, ACM 2000, pp. 310-314.*
Han et al “Mining frequent patterns without candidate generation”, ACM 2000, pp. 1-12.*
Guha et al “CURE: an efficient clustering algorithm for large databases”, ACM 1998, pp. 73-84.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for mining patterns from a dataset does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for mining patterns from a dataset, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for mining patterns from a dataset will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3291632

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.