Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-12-26
2003-12-16
Metjahic, Safet (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000
Reexamination Certificate
active
06665669
ABSTRACT:
TECHNICAL FIELD
This invention relates to data processing and, in particular to a method and system for mining frequent patterns from databases. The invention has application in many fields including business decision making, marketing, customer relation management, medical and biological research, and the like. The field in which this invention falls is variously described as “frequent pattern mining”, “association mining”, and “frequent sets mining”.
BACKGROUND
Finding frequent patterns in databases gaining importance as a way to obtain valuable business information. For example, a merchant who maintains a database containing records of transactions might be interested in determining any patterns in the purchasing habits of the merchant's customers. For example, the merchant may wonder answer questions such as “what pairs of items are typically purchased by consumers at the same time?”. A scientist studying a genome may have a database containing records of gene sequences and may wish to know whether certain sequences tend to occur together in the same stretch of DNA. A researcher studying responses given in a census, survey or consumer questionnaire may wish to identify patterns in the responses. Data mining methods can be applied to these problems and to detecting other types of correlation. Data mining is the field of deriving information about patterns expressed in large collections of information.
Various data mining methods are known. For example, Agrawal et al., U.S. Pat. No. 5,794,209 describes a method for discovering consumer purchasing tendencies. The method is implemented in a computer program which identifies consumer transaction itemsets that are stored in a database and which appear in the database a user-defined minimum number of times.
Agrawal's method belongs to a class of data mining methods called apriori methods. Apriori methods for identifying frequent patterns begin by scanning a database of itemsets, each comprising a number of items and identifying frequent items. The methods then generate candidates for frequent patterns by taking the frequent items taken together in all possible pairs. The database is then scanned to determine the frequency of each candidate pair. Once frequent pairs have been identified then candidates for frequent itemsets each with three items can be generated by taking each frequent pair together with another frequent item. The methods then scan the database to determine which of the candidates for frequent triplets actually occur frequently. The methods can proceed iteratively to identify frequent patterns of any length.
Apriori methods take advantage of the idea that any subset of a frequent itemset must itself be frequent. No frequent itemset can include any items which are not themselves frequent in the database. Apriori-like methods use this idea to prune candidate sets. This dramatically reduces the number of candidate sets that must be checked for frequency. In essence, apriori methods use a known collection of itemsets which are frequent and have (k−1) items to generate candidates for frequent itemsets having k items. Database scanning and pattern matching is used to collect counts for the candidate itemsets.
Apriori-like methods all have the significant disadvantage that they are much slower to execute than is desirable. The time expended by such methods is largely occupied by scanning the database. The candidate sets can be extremely large. For example, in a case where a database contains 10
4
frequent items, an apriori-like method will generate roughly 10
7
candidate itemsets of length 2. The number of candidate itemsets becomes unmanageable in cases where long patterns are being searched for. For example, to discover a frequent pattern of size 100, one needs to generate 2
100
≈10
30
candidates. A large database may contain many gigabytes of data. Even scanning a large set of candidates against a large database takes a significant amount of time even with modern computer hardware running optimized software. Finding long frequent patterns in large databases with apriori-like methods is impractical.
Various techniques have been used to prune candidate sets to make it practical to search for long frequent patterns in databases. However, such techniques are still slow, especially in cases with large databases in which there is a reasonably large number of both long and short frequent patterns.
There is a need for methods and systems for quickly identifying frequent patterns in large databases.
SUMMARY OF THE INVENTION
This invention provides methods and apparatus for mining frequent patterns from databases. The invention has particular application when applied to large databases.
One aspect of the invention provides a method for identifying patterns from a database of records. Each record has a plurality of items. The method comprises constructing an FP-tree for the database; and, mining the FP-tree to obtain frequent patterns. In preferred embodiments of the invention, constructing the FP-tree comprises: scanning the database to obtain an ordered list of frequent items in the database; and, then, for each record in the database: creating a list of any frequent items occurring in that record in the same order as the frequent items occur in the ordered list; setting a root node of the FP-tree as a current node; and, for each item in the list of any frequent items, determining whether there is a node directly linked to the current node which corresponds to the item. If there is a node directly linked to the current node which corresponds to the item incrementing a counter for the node and setting the node as the current node. Otherwise the method creates a node corresponding to the item and linked to the current node and sets the created node as the current node. Preferably the frequent items in the ordered list are ordered in order of their frequency in the database.
The FP-tree preferably comprises a header data structure which includes a record for each of the frequent items in the database.
Mining the FP-tree to obtain frequent patterns preferably comprises: for each frequent item constructing a conditional pattern-base, and constructing a conditional FP-tree from the conditional pattern-base; recursively constructing a conditional pattern-base, and constructing a conditional FP-tree from the conditional pattern-base on each newly created conditional FP-tree until the resulting FP-tree is empty; and, after creating each FP-tree, collecting frequent itemsets from the FP-tree. Preferably the method includes determining whether a conditional FP-tree contains only one path and, of so, generating all combinations of sub-paths of the FP-tree and recording each sub-paths as a frequent pattern.
The invention also provides a method for constructing an FP-tree corresponding to a database and containing information useful for identifying frequent patterns in the database. The method comprises scanning the database to obtain an ordered list of frequent items in the database. Then, for each record in the database, the method: creates a list of any frequent items occurring in that record in the same order as the frequent items occur in the ordered list; sets a root node of the FP-tree as a current node; and, for each item in the list of any frequent items, determines whether there is a node directly linked to the current node which corresponds to the item. If so, the method increments a counter for the node and sets the node as the current node. If not, the method creates a node corresponding to the item and linked to the current node and sets the created node as the current node.
Another aspect of the invention provides a method for identifying patterns from a database of records. Each record has a plurality of items. The method comprises providing an FP-tree corresponding to the database and mining the FP-tree to obtain frequent patterns. Mining the FP-tree comprises, for each frequent item constructing a conditional pattern-base, and constructing a conditional FP-tree from the conditional pattern-base
Han Jiawei
Mao Runying
Pei Jian
Yin Yiwen
Alaubaidi Haythim J.
DB Miner Technology Inc.
Oyen Wiggs Green & Mutala
LandOfFree
Methods and system for mining frequent patterns does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Methods and system for mining frequent patterns, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods and system for mining frequent patterns will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3114443