Integrated database and data-mining system

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C706S047000

Reexamination Certificate

active

06324533

ABSTRACT:

FIELD OF THE INVENTION
The present invention generally relates to data processing, and more particularly, to a method and system for efficiently mining data relationships from an integrated database and data-mining system.
BACKGROUND OF THE INVENTION
A recent trend in database applications is the installation of larger and larger data warehouses built around relational database technology in an increasing number of enterprises. There is a great demand for being able to mine nuggets of knowledge from these data warehouses. The initial research on data mining was concentrated on defining new mining operations and developing algorithms for them. Most early mining systems were developed largely on file systems in which specialized data structures and buffer management strategies were devised for each mining algorithm. Coupling data mining with database systems was at best loose, and access to data in a database management system (DBMS) was provided through an Open Database Connectivity (ODBC) interface or a Structured Query Language (SQL) cursor interface. Such an interface is described, for example, in the IBM Intelligent Miner User's Guide, Version 1 Release 1, published by International Business Machines Corp., July 1996. Other interfaces are described by R. Agrawal et al. in the paper “The Quest Data Mining System,” Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oreg., August 1996, and by J. Han et al. in “DMQL: A Data Mining Query Language For Relational Databases,” Proc. of the 1996 SIGMOD Workshop on Research Issues on Data mining and Knowledge Discovery, Montreal, Canada, May 1996.
Recently, researchers have started to focus on issues related to integrating mining with databases, such as proposals to extend the SQL language to support mining operators. For instance, the query language DMQL proposed by Han et al. extends SQL with a collection of operators for mining characteristic rules, discriminant rules, classification rules, association rules, etc. In the paper “Discovery Board Application Programming Interface and Query Language for Database Mining,” Proc. of the 2nd Int'l Conference on Knowledge Discovery and Data Mining, Oregon, August 1996, Imielinski et al. extend M-SQL, which is an extension of the SQL language with a special unified operator to generate and query a whole set of propositional rules. Another example is the mine rule operator proposed by Meo et al. for a generalized version of the association rule discovery problem, described in “A New SQL Like Operator For Mining Association Rules,” Proc. of the 22nd Int'l Conference on Very Large Databases, India, September 1996. Tsur et al. also proposed “query flocks” for data mining using a generate-and-test model, as described in “Query Flocks: A Generalization of Association Rule Mining,” available on the World Wide Web at http://db.stanford.edu/ullman/pub/flocks.ps, October 1997.
The issue of tightly coupling a mining algorithm with a relational database system from the systems point of view was addressed by R. Agrawal et al. in a paper entitled “Developing Tightly-coupled Data Mining Applications On A Relational Database System,” Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Oregon, August 1996. This proposal makes use of user-defined functions (UDFs) in SQL statements to selectively push parts of the application that perform computations on data records into the database system. The objective here was to avoid one-at-a-time record retrieval from the database to the application address space, saving both the copying and process context switching costs. In the paper entitled “KESO: Minimizing Database Interaction,” Proc. of the 3rd Int'l Conference on Knowledge Discovery and Data Mining, August 1997, A. Siebes et al. focus on developing a mining system with minimal database interaction. Another algorithm for finding association rules, SETM, was expressed in the form of SQL queries and described by M. Houtsma et al. in “Set-oriented Mining of Association Rules,” Proc. of the Int'l Conference on Data Engineering, Taiwan, 1995. However, SETM is not efficient and there are no results reported on running it against a relational DBMS.
Therefore, there is still a need of a method for efficiently mining data from an integrated database and data-mining system that has a shorter response time, requires less memory to operate, does not suffer the disadvantages discussed in the Background section.
SUMMARY OF THE INVENTION
It is an object of the present invention to provide an efficient method for extracting data relationships from an integrated database and data-mining system.
Another object of the present invention is to provide a method for mining data relationships from the integrated mining system in the form of queries to SQL engines, and with k-way join, three-way join, subqueries, and group-by operations for counting the itemset support.
Still another object of the present invention is to provide a method for mining data relationships from the integrated mining system in the form of queries to SQL engines enhanced with object-relational extensions (SQL-OR), such as user-defined functions (UDFs) and table functions.
The present invention achieves the foregoing and other objects by providing a method for identifying rules from an integrated database and data-mining system, where the system includes a database in the form of a table of transactions and a query engine. The method includes the steps of: a) performing a group-by query on the transaction table to generate a set of frequent 1-itemsets; b) determining frequent 2-itemsets from the frequent 1-itemsets and the transaction table; c) generating a candidate set of (n+2)-itemsets from the frequent (n+1)-itemsets, where n=1; d) determining frequent (n+2)-itemsets from the candidate set of (n+2)-itemsets and the transaction table using a query; e) repeating steps (c) and (d) with n=n+1 until the candidate set is empty; and f) generating rules from the union of the determined frequent itemsets.
The group-by query preferably includes the steps of counting the number of transactions that contain each item and selecting the items that have a support above a user-specified threshold in determining the frequent 1-itemsets. The support of an item is the number of transactions that contain the item. In determining the frequent 2-itemsets, each 1-itemset is joined with itself and two copies of the transaction table using join predicates. The joining results for a pair of items are grouped together in counting the support of the items in the pair. All 2-itemsets that have a support below a specified threshold are removed from the set of 2-itemsets, resulting in the frequent 2-itemsets.
Preferably, the candidate (n+2)-itemsets are generated using an (n+2)-way join operation. Also, the frequent (n+2)-itemsets are determined using an (n+3)-way join query on the candidate itemsets and (n+2) copies of the transaction table. Alternatively, the frequent (n+2)-itemsets are determined using cascaded subqueries by: a) selecting distinct first items in the candidate itemsets using a subquery; b) joining the distinct first items with the transaction table; c) cascading (n+1) subqueries where each j-th subquery is a three-way join of the result of the last subquery, distinct j items from the candidate itemsets, and the transaction table; and d) using the results of the last subqueries to determine which of the (n+2)-itemsets are frequent.
In generating rules from the union of the frequent itemsets, all items from the frequent itemsets are first put into a table F. A set of candidate rules is created from the table Fusing a table function. These candidate rules are joined with the table F, and filtered to remove those that do not meet a confidence criteria.
In the case where the mining engine is an object-relational engine, the generation of 2-itemsets and (n+1)-itemsets preferabl

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Integrated database and data-mining system does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Integrated database and data-mining system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Integrated database and data-mining system will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2617632

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.