Systems and methods for discovering mutual dependence patterns

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details Systems and methods for discovering mutual dependence patterns Systems and methods for discovering mutual dependence patterns

: 2001-07-30
: 2004-12-07
: Corrielus, Jean M. (Department: 2172)
: Data processing: database and file management or data structures
: Database design
: Data structure types

: C707S793000, C707S793000, C707S793000, C707S793000, C707S793000
: Reexamination Certificate
: active
: 06829608
: ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to data processing techniques and, more particularly, to methods and apparatus for discovering mutual dependence patterns in data.
BACKGROUND OF THE INVENTION
Data mining seeks to discover interesting and previously unknown patterns or information from a large amount of historical data often stored in a database. Specifically, one key aspect of data mining is to search for significant patterns embedded in the data. Here, a pattern refers to a set of items denoted as pat={i
1
, i
2
, i
3
, . . . , i
k
}, where i
j
is the j-th item.
Existing approaches have focused on discovering one special form of pattern called a frequent association pattern, referred to as “fa-pattern.” Fa-patterns are patterns whose support (or occurrences) in data is above a predefined minimum support threshold called minsup. Several applications of the fa-pattern have been studied. The most popular one is the “market basket” analysis, in which an algorithm is applied to mine transaction data consisting of a set of transactions. A transaction is a set of items purchased by a customer. For example, a customer may buy milk, bread, and beer, together. The corresponding transaction is thus trans={a, b, c}, where a, b, and c represent milk, bread, and beer, respectively. The association discovery problem can be formally stated as: find all patterns (i.e., a set of items) whose number of co-occurrences in D is above a predefined threshold called minimum support (minsup), where D is a set of N transactions {trans
1
, . . . , trans
N
}. We note that an item here is a generic name. It is mapped to an original data object by a certain mapping scheme. For example, an original data object of transaction data may have multiple attributes such as the type of goods (e.g., milk, beer), its brand, quantity, and purchase time. One commonly-used mapping is to map the values of the type into items. For example, milk is represented by “a.”
Fa-patterns can be generalized to handle temporal events. Here, the temporal event data is an ordered sequence with length N: {(i
1
, t
1
), (i
2
, t
2
), . . . , (i
N
, t
N
)}, where time t
i
t
j
if i≦j. The temporal association discovery problem can be stated as: find all patterns whose number of co-occurrences within a time window w is above minsup. Here, the time window is introduced to essentially segment an ordered time sequence into transactions.
Finding all fa-patterns is not a trivial task because the pattern space is exponentially large, to be precise, n
k
, where n is the number of distinct items, and k is the maximum length of a pattern. Brute-force iteration is computationally intractable. Recently, Agrawal et al. (as described in R. Agrawal et al., “Mining Association Rules Between Sets of Items in Large Databases,” Proc. of VLDB, pp. 207-216, 1993, the disclosure of which is incorporated by reference herein) developed an algorithm called “Apriori” to discover all fa-patterns. This algorithm searches the pattern space in a level-wise manner by the following four step process:
1. Initialization. The data is scanned to find all fa-pattern with only one item. k is set to be 2.
2. Construct candidate patterns with length k. This is typically done by a joint operation of fa-patterns found in the previous level, followed by a pruning operation.
3. Count the candidate patterns. Data is scanned in order to count the occurrences of candidate patterns.
4. Find fa-patterns at the k-th level. Fa-patterns are those candidate patterns whose count (or occurrences) are above minsup.
This procedure proceeds level by level until no more patterns can be found. The key idea of this algorithm is to search the pattern-space in a level-wise manner. The fa-patterns found at the current level are used to eliminate the search space for the next level. In this way, the number of patterns to be searched are minimized, and the number of data scans is the maximum length of fa-patterns. Since the introduction of the “Apriori” algorithm, work has been done to improve the algorithm so as to reduce the number of data scans, reduce the memory requirement, and improve efficiency through different search strategies.
Needless to say, association discovery has been widely employed for applications such as market basket analysis. The technique's success is partly because frequent patterns capture patterns that are popular, and thus provide valuable information for directly marketing to relevant portions of a large population, rather than to the entire population. This maximizes the advertisement return.
However, frequent patterns may not be of equal interest in other tasks such as, for example, problem detection in computer system management, intrusion detection in computer systems, and credit card fault detection. There are several fundamental reasons for this situation:
1. Frequent patterns are not always of interest. In the aforementioned applications, the normal operations or behaviors are usually massive in quantity. It is not a surprise that a large number of frequent patterns can be found in these applications. However, most of them relate to normal behaviors, and thus are usually not very informative. This is simply because normal operations are usually known by domain expertise, and are actionable. Furthermore, even if a frequent pattern relates to a problem, it is usually known already through other means, since unknown problematic situations are, in general, infrequent in these applications.
2. Infrequent patterns may be of interest. For example, a system management application may be required to discover problematic situations that are expected to be rare. Applying existing algorithms for discovering association patterns will not result in finding infrequent patterns, unless the minimum support threshold is set to be extremely low. This, however, results in far many uninteresting patterns.
3. Co-occurrence does not necessarily reflect the dependence of items in a pattern. By definition, the occurrences of an fa-pattern is at least minsup. However, this does not guarantee any real dependence among items in an fa-pattern. In an extreme case, a set of independent items may be qualified as an fa-pattern because the frequent association does not take into consideration the distribution of each item. This is further explained by the following example. Assume that items a and b occur independently and randomly in 50% of all transactions. In this case, the expected frequency of the co-occurrence of {a,b} is 25%, which is still pretty significant, and may well be above minsup.
SUMMARY OF THE INVENTION
The present invention provides techniques for mining or discovering infrequent patterns that can not be found effectively and efficiently using existing pattern mining techniques and which may be valuable for a variety of applications. Specifically, the invention defines a mutual dependence pattern or “m-pattern.” An m-pattern captures a set of items that often occur together regardless of the number of occurrences. Thus, infrequent, but interesting, patterns may be found.
In one aspect of the invention, a technique for mining one or more patterns in an input data set of items comprises identifying one or more sets of items in the input data set as one or more patterns based on respective comparisons of conditional probability values associated with each of the one or more sets of items to a predetermined threshold value. The one or more identified patterns are output based on results of the comparisons. The input data set may comprise such data as event data and/or transaction data.
In one embodiment, the identifying operation may comprise identifying a set of items in the input data set, which includes at least two subsets of at least one item, as a pattern when the set of items has a conditional probability value computed therefor that is not less than a predetermined threshold value, wherein the conditional probability value is indicative of a probability that both of the at l

Affiliated with

Hellerstein Joseph L.

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Ma Sheng

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

Corrielus Jean M.

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

International Business Machines - Corporation

Corporate Assignee

[ 0.00 ] – not rated yet Voters 0 Comments 0

Perez-Pineiro Rafael

Attorney

[ 0.00 ] – not rated yet Voters 0 Comments 0

Ryan & Mason & Lewis, LLP

Law Firm

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Systems and methods for discovering mutual dependence patterns does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Systems and methods for discovering mutual dependence patterns, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Systems and methods for discovering mutual dependence patterns will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-3279927

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure