Multi-dimensional database record compression utilizing...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details Multi-dimensional database record compression utilizing... Multi-dimensional database record compression utilizing...

: 2000-06-29
: 2003-10-14
: Metjahic, Safet (Department: 2171)
: Data processing: database and file management or data structures
: Database design
: Data structure types

: C707S793000
: Reexamination Certificate
: active
: 06633882
: ABSTRACT:

FIELD OF THE INVENTION
The present invention concerns data mining of large database and more particularly to an improved method and apparatus for answering queries based on the data contained in large databases.
BACKGROUND ART
A database management system (DBMS) stores data reliably and retrieves the data based on a data access language, typically SQL (Structured Query Language). One major use of database technology is to help individuals and organizations obtain fast/flexible reports on performance of a business. Another major use of DBMS is to help make decisions and generate reports based on the data contained in the database.
An important aid to the users making decisions based on data in the database is a multidimensional view of data. For example, in a payroll application, the data could be represented with three dimensions such as salary, years of employment and vested dollar amounts in a company pension plan. With this view of the data, it is easy to ask queries such as query that finds the average number of years of employment of employees who have a salary greater than 100,000 dollars. Alternatively, it might be desurable to specify a range of years of employment and ask for the total contribution in dollars for people falling in that range. The multidimensional approach facilitates the view of the data and also does not distinguish between the dimensions in which ranges can be specified and the dimensions on which sums and aggregations are performed. Further, users can specify ranges in different dimensions if they need to view data in more detail. Thus, in the above example, users could ask for the average salary of employees who have more than 100,000 dollars in vested company benefits and who have worked for the company for less than 5 years. Other applications that benefit from the multidimensional view of data include inventory control, sales and marketing, user, or customer databases.
One of the emerging applications of databases is online analytical processing (OLAP) which enables users to ask decision support queries in a more natural and efficient manner than SQL-based database systems. A commonly used approach for queries that use OLAP is the so-called data cube approach which views the data records stored in the database as points in a multidimensional space. Users ask queries using the data cube approach by specifying in the query the dimensions or record attributes of interest and ranges for those dimensions.
One commonly asked OLAP query is a count query that determines the number of data records falling within a selected range. A second common query is the aggregate query which totals the values for the records falling within a selected range. By adding or deleting a dimension to a query a user can drill down or roll up over the multi-dimensional space of the database.
Transact-SQL (T-SQL) is a superset of standard SQL and forms part of SQL server version 6.5, a product of Microsoft Corporation, assignee of the present invention. Transact SQL provides two special operators, CUBE and ROLLUP, that allow multidimensional analysis to be projected by the database server where the analysis can be optimized as part of the execution plan of the query. A discussion of the CUBE and ROLLUP operators is found in the book “Inside SQL Server 6.5 ” by Soukup, Copyright 1997, Microsoft Press, pp 326-340 which is incorporated herein by reference.
As data warehousing becomes more popular, OLAP is gaining in importance as a primary interface to evaluating data contained in the data warehouse. Most successful data mining applications include reporting systems having fast query response mechanisms. Most corporations require decision support and would benefit from improved technology to help in making decisions based upon rapidly gathered and organized data.
Specific applications include marketing aided by querying past sales data, inventory management based on querying the inventory database, hierarchical visualization of employee data and many other decision support applications. The applications of OLAP are not limited to large organizations. The concept of the data cube could be viewed as a hierarchical organization of data in general and this could be an easily understood interface that provides summarized and organized access to large volumes of data to every type of user.
One goal of a database management system is efficient support for queries on data cubes. The prior art has (a) dealt with providing exact answers to queries and (b) dealt mainly with discrete valued dimensions and (c) treated the dimension on which the aggregation or average is performed as a dimension that is distinct from the dimensions of the data cube. Because prior art techniques concentrated on providing exact answers, the amount of data that was stored was large and expensive disk accesses were required, thus making query processing expensive. Not being able to efficiently pose queries on continuous valued dimensions limits the types of applications and the scope of data cubes. Treating all dimensions symmetrically and allowing aggregation on every dimension is not possible with such prior art systems.
Being able to quickly query the data presented in the multidimensional format is crucial to the effective usage of a data cube. Typically, answers to the queries on the data cube are not required with perfect accuracy and users are willing to trade off accuracy for rapid answers to these queries.
SUMMARY OF THE INVENTION
The present invention enables the effective summarization of large amounts of data and provides rapid answers to decision support queries on data cubes. A principle feature of the invention is the recognition that data can be viewed as points in a multidimensional space and hence can be effectively summarized using clustering techniques.
An exemplary embodiment of the present invention concerns method and apparatus for querying a database containing data records stored in the database. A cluster model is formed from data records that are stored in the database. An initial cluster model having an initial probability distribution describing data records in the database is thus provided. By comparing the initial probability distribution with a representative sample of records in the database a sufficiency of this initial probability distribution is determined. The cluster model is then modified to provide an adjusted cluster model that better characterizes the data in the database. The cluster modification is performed by adjusting the cluster model to reduce discrepancies between the initial probability distribution and data sampled from the database.
Once the adjusted cluster model is provided, a sum or a count of data records from the database falling within specified ranges of the multiple dimensions is determined by integrating a functional representation based on the probability distribution of the adjusted cluster model over the ranges. In one embodiment of the invention the cluster model is adjusted by increasing the cluster number of the cluster model and reclustering at least a portion of the data in the database.
Choice of a starting point for a new cluster is performed by finding a region within an attribute space of the data records of high discrepancy between the initial cluster model and a sample of data gathered from the database. Since this region is not well modeled by initial cluster model, it serves as a candidate point for a new cluster and will result in a shifting of existing clusters and formation of a new cluster near the region of high discrepancy.
Records in the database have one or more attributes containing continuous, nondiscrete data values. Each of these cluster attributes is characterized by a Gaussian functional depiction of the cluster that can be integrated (using numerical approximation techniques) over the data space of the records.
Experience with data clustering has indicated that clustering models do not well model so called outlying data records from removed from any cluster mean. In one embodiment of the invention the functional represe

Affiliated with

Fayyad Usama

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Shanmugasundaram Jayavel

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

Filipczyk Marc

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

Metjahic Safet

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

Microsoft Corporation

Corporate Assignee

[ 0.00 ] – not rated yet Voters 0 Comments 0

Watts Hoffmann Co. LPA

Law Firm

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Multi-dimensional database record compression utilizing... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Multi-dimensional database record compression utilizing..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Multi-dimensional database record compression utilizing... will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-3134103

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure