Method and apparatus for dynamically counting large itemsets

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06185559

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to a method and apparatus for analyzing database records to uncover relationships between various items contained within the records, and in particular to a method and apparatus for counting all sets of items dynamically within a database of individual records using fewer passes over the records than conventional data mining methods.
BACKGROUND OF THE INVENTION
In recent years, the progress of information automation has increased the computer databases of modem businesses to the point where a blizzard of numbers, facts and statistics are collected and stored, but less information of any significance is extracted from the massive amounts of data. The problem is that conventional computer databases are powerful in the manner in which they house data, but unimaginative in the manner of searching through the data to extract useful information. Simply stated, the use of computers in business and scientific applications has generated data at a rate that has far outstripped the ability to process and analyze it effectively.
To address this problem, a practice known as data “mining” is being developed to identify and extract important information through patterns or relationships contained in available databases. Humans naturally and quickly “mine” important concepts when interpreting data. A person can scan a magazine's table of contents and easily select the articles related to the subject of interest. A person's ability to extract and identify relationships goes beyond the simple recognizing and naming of objects, it includes the ability to make judgments based on overall context or subtle correlations among diverse elements of the available data. Computers on the other hand, cannot efficiently and accurately undertake the intuitive and judgmental interpretation of data. Computers can, however, undertake the quantitative aspects of data mining because they can quickly and accurately perform certain tasks that demand too much time or concentration from humans. Computers, using data mining programs and techniques are ideally suited to the time-consuming and tedious task of breaking down vast amounts of data to expose categories and relationships within the data. These relationships can then be intuitively analyzed by human experts.
Data mining techniques are being used to sift through immense collections of data such as marketing, customer sales, production, financial and experimental data to “see” meaningful patterns or regularities and identify what is worth noting and what is not. For example, the use of bar-code scanners at supermarket checkouts typically results in millions of electronic records which, when mined, can show purchasing relationships among the various items shoppers buy. Analysis of large amounts of supermarket basket data (the items purchased by an individual shopper) can show how often items are purchased together, such as, for example, fruit juice, children's cereals and cookies. The results can be useful for decisions concerning inventory levels, product promotions, pricing, store layout or other factors which might be adjusted to changing business conditions. Similarly, credit card companies, telephone companies and insurers can mine their enormous collections of data for subtle patterns within thousands of customer transactions to identify risky customers or even fraudulent transactions as they are occurring. Data mining can also be used to analyze the voluminous number of alarms that occur in telecommunications and networking alarm data.
The size of the data set is essential in data mining: the larger the database, the more reliable the relationships which are uncovered. Large databases, unfortunately, have more records to shift through and require many passes through the records or multiple passes through each record to uncover any regularities. The number of items for which the relationships are sought is also important to the efficiency of data mining operations: the larger the number of items, the more passes through the records that are required to extract reliable information.
Consider data mining of supermarket basket data as an example. A supermarket contains a set of items (its products), of which each shopper purchases a subset. In analyzing the volumes of subsets, it is desirable to find the “significant” purchases or large itemsets (sets of items, such as all transactions that included the purchase of fruit juice). A large itemset contains items (fruit juice, cookies) that appear in at least a preset number of subsets (customer purchases). In data mining parlance, this number is called the support threshold.
The best known method for finding large itemsets is the Apriori method described in the publication,
Fast Algorithms of Mining Association Rules
, by R. Agrawal and R. Srikant—Proceedings of the 20
th
VLDB Conference; Santiago, Chile, 1994. To discover large itemsets, the Apriori method makes multiple passes over the transaction records (see FIG.
1
). In the first pass, the Apriori method counts the support of individual items and determines which of them are large, i.e., have minimum support. In each subsequent pass, this method starts with a seed set of itemsets found to be large in the previous pass. This seed set is used for generating new potentially large itemsets, called candidate itemsets, and the actual support for these candidate itemsets are counted during the pass over the data. At the end of the pass, the candidate itemsets which are actually large are determined, and they become the seed for the next pass. This process continues, pass after pass over the data, until no new large itemsets are found.
The Apriori method generates the candidate itemsets to be counted in a pass by using only the itemsets found to be large at the end of the previous pass—without considering the transactions during its pass over the database. The basic notion is that any subset of a large itemset must be large. Therefore, the candidate itemsets can be generated by joining large itemsets having less items, and deleting those that contain any subset that is not large. This procedure results in the generation of the final number of candidate itemsets at the end of a pass over the database.
As can be seen, the shortcoming of the Apriori method is that as the size of the database increases, or the number of items searched increases, so does the number of passes over the database records, from top to bottom. In very large databases, the database activity overhead due to the reported passes can reduce the execution time and therefore the efficiency of this method to an undesirable level.
A more recent data mining technique that attempts to avoid the limitations of the Apriori method is that disclosed by H. Toivonen in the paper,
Sampling Large Databases for Association Rules
, H. Toivonen, Proceedings of the 22
nd
VLDB Conference, Bombay, India, 1996. Toivonen presents database mining methods which attempt to make only one full pass over the database. The Toivonen method randomly picks a sample record from the database, uses it to determine the relationship or pattern on the assumption that it probably holds for the entire database, and then verifies the results with the rest of the database.
The Toivonen method partitions the database into sections small enough to be handled in main memory, thereby reducing I/O activity to the database (which normally resides on disk). Then, an evaluation for large itemsets for a random part is performed in main memory without further database activity. The method uses the random sample and makes a series of passes over the data to determine which items are frequently found. Each pass builds on the previous collection of frequently found items until the method efficiently finds a superset from the collection of frequently found sets. In order to increase accuracy, the superset is determined by applying this “level-wise” method on the sample in main memory, and by using a lowered frequency threshold. The method then uses the rest of the databa

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for dynamically counting large itemsets does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for dynamically counting large itemsets, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for dynamically counting large itemsets will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2567503

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.