Aggregations performance estimation in database systems

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06374234

ABSTRACT:

COPYRIGHT NOTICE AND PERMISSION
A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever. The following notice shall apply to this document: Copyright© 1999, Microsoft, Inc.
TECHNICAL FIELD OF THE INVENTION
The present invention pertains generally to computer-implemented databases, and more particularly to summaries of data contained in such databases.
BACKGROUND OF THE INVENTION
Online analytical processing (OLAP) is a key part of most data warehouse and business analysis systems. OLAP services provide for fast analysis of multidimensional information. For this purpose, OLAP services provide for multidimensional access and navigation of data in an intuitive and natural way, providing a global view of data that can be drilled down into particular data of interest. Speed and response time are important attributes of OLAP services that allow users to browse and analyze data online in an efficient manner. Further, OLAP services typically provide analytical tools to rank, aggregate, and calculate lead and lag indicators for the data under analysis.
In this context, a dimension is a structural attribute of a cube that is a list of members of a similar type in the user's perception of the data. For example, a time dimension can consist of days, weeks, months, and years, while a geography dimension can consist of cities, states/provinces, and countries. Dimensions act as indices for identifying values within a multi-dimensional array.
Databases are commonly queried for summaries of data rather than individual data items. For example, a user might want to know sales data for a given period of time without regard to geographical distinctions. These types of queries are efficiently answered through the use of data tools known as aggregations. Aggregations are precomputed summaries of selected data that allow an OLAP system or a relational database to respond quickly to queries by avoiding collecting and aggregating detailed data during query execution. Without aggregations, the system would need to use the detailed data to answer these queries, resulting in potentially substantial processing delays. With aggregations, the system computes and materializes aggregations ahead of time so that when the query is submitted to the system, the appropriate summary already exists and can be sent to the user much more quickly.
Calculating these aggregations, however, can be costly, both in terms of processing time and in terms of disk space consumed. Therefore, in many situations, efficiencies can be realized by materializing only selected aggregations rather than all possible aggregations. The aggregations that are materialized or computed should be selected based on the implications of using or not using each aggregation. These implications include, for example, the potential performance gain associated with using a set of selected aggregations.
Some conventional solutions measure potential performance gain by reading and aggregating the detailed data underlying the aggregations. This approach gives an accurate result, but can itself consume considerable computing resources, especially if the aggregations summarize a large amount of detailed data. Further, potential performance gain is often expressed in terms of time or computing resources saved. This information, however, is often of little use without additional information, such as baselines or information about the operating environment. Accordingly, a need continues to exist for a system that can estimate the potential performance gain of using a set of selected aggregations without reading and aggregating detailed data. This potential performance gain should be expressed in an intuitive manner to be of benefit to the user.
SUMMARY OF THE INVENTION
According to various example implementations of the invention, there is provided an efficient system for estimating the potential performance gain associated with using a set of selected aggregations without reading and aggregating the detailed data underlying the aggregations, as described herein below. In particular, the invention provides, among other things, for using aggregation sizes to measure the cost of materializing and maintaining the aggregations and, in turn, the potential benefit of using alternative aggregations.
In one particular implementation, the potential performance gain is estimated by determining a minimum cost TCm and a maximum cost TCf associated with executing the set of queries, as well as a cost TCa associated with executing the set of queries using the set of proposed aggregations. The potential performance gain is calculated as a function of the minimum cost TCm, the maximum cost TCf, and the cost TCa.
In another implementation, instead of determining TCa, the system instead determines a benefit of using each aggregation of the set of proposed aggregations to execute the set of queries and sums the determined benefits over all of the aggregations of the set of proposed aggregations. The potential gain is calculated as a ratio of the resulting sum to a difference between the maximum cost TCf and the minimum cost TCm.
Yet another implementation is directed to a method for estimating the potential performance gain by determining TCf as a product of a size of the detailed data and a number of queries in the set of queries. TCm is also determined. TCa is determined by, for each query of the set of queries, determining a best aggregation of the proposed set of aggregations for answering the query. The sizes of the determined best aggregations are summed along with, for each query for which no aggregation of the proposed set of aggregations is sufficiently detailed to answer, a size of the detailed data. The potential performance gain is the ratio of (TCf−TCa) to (TCf−TCm).
Still other implementations include computer-readable media and apparatuses for performing these methods. The above summary of the present invention is not intended to describe every implementation of the present invention. The figures and the detailed description that follow more particularly exemplify these implementations.


REFERENCES:
patent: 5918232 (1999-06-01), Pouschine et al.
patent: 5926820 (1999-07-01), Agrawal et al.
patent: 6115714 (2000-09-01), Gallagher et al.
patent: 6205447 (2001-03-01), Malloy
patent: 6282546 (2001-08-01), Gleichauf et al.
Chatziantoniou et al. (IEEE publication, Apr. 2001) discloses the MD-join: An operator for complex OLAP; pp. 524-533.*
Chen et al. (IEEE publication, Mar. 2000) discloses a data-warehouse/OLAP framework for scable telecommunication tandem traffic analysis, pp. 524-533.*
Flores, et al. (IEEE publication, Jul. 2000 discloses characterization of segmentation methods for multidimensional metrics, pp. 524-533.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Aggregations performance estimation in database systems does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Aggregations performance estimation in database systems, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Aggregations performance estimation in database systems will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2883371

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.