Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-06-30
2004-02-10
Metjahic, Safet (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06691120
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to data mining of normalized databases, and more particularly to a system, method and computer program product for transforming normalized data records with multidimensional attributes into row-entity-integrity tables for use by a data-mining tool.
2. Related Art
Traditional data warehousing uses only partially normalized data models. Typical data warehousing models include star, snowflake or constellation type schema. Such schema commonly utilize a large central “fact table”, and a series of look-up tables for multidimensional attributes. Detailed data for the multidimensional attributes can be aggregated at various levels, depending on the particular need. However, there is still only one level of relation in typical data warehousing models. By storing the warehoused data in this form, traditional data warehousing reduces required storage space, while minimizing difficulties associated with data mining in a normalized database containing multidimensional attributes.
The aforementioned schema work well for traditional data warehousing applications, such as those deployed in the finance, marketing and retail sectors. In these applications, the main facts (tracked attributes) rarely change. Moreover, the tracked attributes almost never change in dimension. For example, attributes such as name, address, phone, age, sex, account number, item purchased, etc. are constants within the data model. This constant nature of the data allows optimization of traditional data warehousing schema for query and analysis. The query and analysis tools are also optimized for the particular application.
The manufacturing industry, on the other hand, has not implemented general data warehousing and analysis methods because the attributes measured in a discrete or process oriented manufacturing system can, and do, vary over time and between product families. For example, manufactures of disk drives track thousands of attributes ranging from time and place of manufacture to product family to head resistance. Many of these tracked attributes are multidimensional, and many change their dimensionality over time, or even cease to exist as a tracked attribute. As new technologies are introduced, the tracked attributes change. This is known as having a “slowly moving dimension.”
When a tracked attribute changes, index lines into the database tables, which help in querying the database, must be reconstructed. As the quantity of data being stored in modern data warehousing systems continues to grow, this re-indexing becomes problematic. For example, manufactures of disk drives commonly manufacture many tens of thousands of disk drives each day. For each of these drives, manufacturers typically record and track thousands of attributes, many of which are multidimensional. Additionally, the technology that goes into each disk drive commonly changes significantly every nine to twelve months, and the test programs used for the drives commonly change on a weekly basis, both of which require attribute changes and thus re-indexing. Over a number of years, the amount of data and the number of attributes involved make the re-indexing task exceptionally time consuming and expensive. Thus, modem data warehousing is moving toward using more normalized data models, such as third normal form, to further reduce the use of storage space, to minimize re-indexing, and to reduce modification anomalies caused by ongoing, changes to the tracked attributes.
Data-mining tools are software-based data-analysis methods for finding interesting patterns in large volumes of data. Data mining commonly uses predefined algorithms to look for these patterns in the data. Typical predefined data-mining algorithms include clustering algorithms, correlation analysis, association, decision tree, or neural networks. Traditional data-mining tools require that data be input as a flat table (or flat file), i.e. a single two-dimensional table. Moreover, traditional data-mining tools assume that the flat table. input is a row-entity-integrity (REI) table. An REI table is one in which each row is guaranteed unique, and each row represents a distinct and separate data item. Data-mining tools make this assumption about their input in order to enable the use of simple probability analyses in their data-mining algorithms.
Some current data-mining tools provide Open DataBase Connectivity (ODBC) through which data from a database can be imported and analyzed. However, because the data tracked in many modem data warehouses exist at different levels of aggregation, the data imported by these ODBC tools cannot be provided directly to the data-mining algorithms for analysis without causing problems.
Using traditional data-mining tools with traditional data warehousing systems is difficult and time consuming if the data has multidimensional attributes. Because traditional data-mining tools cannot accept a normalized database with multi-dimensional attributes as input, a person commonly serves as the intermediary between the data warehouse and the data mining. Typically, the person generates queries to select portions of the data and output it as a flat table. This flat table is then given as input to the data-mining tool. However, because the data-mining tool makes assumptions about the input that are not true, its output must always be carefully reviewed for non-sequiturs, i.e. observations about the data which are not relevant or logical.
FIG. 1
is a block diagram illustrating an exemplary normalized database and a prior art database view for use by a data-mining tool. Referring now to
FIG. 1
, a normalized database
110
comprises multiple tables and a normalized data record
120
. Each table in the normalized database
110
has attributes associated with it. For example, the parent table has attributes PK, A
1
, A
2
, A
3
, A
4
, and A
5
. Attribute PK is labeled as such because it serves as a primary key for the parent table. The primary key PK uniquely identifies each row in the parent table. Moreover, because each row in the parent table represents a distinct and separate data item, the parent table, by itself, is an REI table and is thus in a form suitable for input to standard data-mining tools.
Child table R, on the other hand, contains multidimensional attributes. The primary key for child table R is compound (or composite) because it includes both a foreign key attribute FK and a local key attribute LK. For example, if FK is a serial number of a particular disk drive, and LK is a head for the particular disk drive, then each attribute D
1
through D
4
is a data measure for the disk drive, aggregated at the head level. If one wished to perform data mining on child table R at the drive level (which is to say that each drive is considered the distinct and separate data item), then child table R is not an REI table and is not in a form suitable for input to standard data-mining tools.
This is further illustrated by a database view
140
, which can be a standard database query, a specialized database script, or the like. A database view is generally a database script, which creates either on the fly, or permanent tables that exist in a denormalized form. These tables can contain actual values, or simply indices into the normalized database
110
. When the normalized data record
120
is collected from the normalized database
110
, an output data record
150
is created, but this output data record
150
is not an REI data record for the level of aggregation found in the parent table. Data values x
1
through x
9
(or indices to these values) are repeated unnecessarily, but more importantly, each row does not represent a distinct and separate data item. The overrepresentation of some attributes can cause the data-mining tool to find significant patterns where there are none.
Moreover, when separate but related attributes are aggregated at different levels, a database view will vary the amount of representation of a data item depending on the manner in which the
Aldridge Bruce E.
Durrant Douglas J.
Lyon & Lyon LLP
Metjahic Safet
NCR Corporation
Nguyen Cindy
LandOfFree
System, method and computer program product for data mining... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System, method and computer program product for data mining..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System, method and computer program product for data mining... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3348269