Learning from empirical results in query optimization

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06763359

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular, to learning from empirical results in query optimization.
2. Description of Related Art
(Note: This application references a number of different publications as indicated throughout the specification by mnemonics enclosed in brackets, e.g., [Authorxx], wherein Author is the author's name (or abbreviation thereof) and xx is the year of publication. A list of these different publications with their associated mnemonics can be found in Section 6 entitled “Bibliography” in the “Detailed Description of the Preferred Embodiment.” Each of these publications is incorporated by reference herein.)
Computer systems incorporating Relational DataBase Management System (RDBMS) software using a Structured Query Language (SQL) interface are well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Institute (ANSI) and the International Standards Organization (ISO).
In an RDBMS system, queries typically specify what data is to be accessed, rather than how that data is to be accessed. An SQL Query Compiler, and specifically an optimizer function of the SQL Query Compiler, automatically determines the appropriate way to access and process the data referenced in a single SQL query. This is done by considering many possible query execution plans (QEPs), evaluating the estimated cost of each plan, and choosing the cheapest plan in terms of estimated execution cost.
The estimated execution cost is largely dependent upon the number of rows that will be processed by each operator in the QEP. Estimating the number of rows or cardinality after one or more predicates have been applied has been the subject of much research for over 20 years [SAC+79, Gel93, SS94, ARM89, Lyn88]. Typically, this estimate relies on statistics of database characteristics, beginning with the number of tows for each table, multiplied by a filter factor or selectivity for each predicate, derived from the number of distinct values and other statistics on columns. The selectivity of a predicate P effectively represents the probability that any row in the database will satisfy P.
While query optimizers do a remarkably good job of estimating both the cost and the cardinality of most queries, many assumptions underlie this mathematical model. Examples of these assumptions include:
Currency of information: The statistics are assumed to reflect the current state of the database, i.e. that the database characteristics are relatively stable.
Uniformity: Although histograms deal with skew in values for “local” selection predicates (to a single table), there are no products available that exploit them for joins.
Independence of predicates: Selectivities for each predicate are calculated individually and multiplied together, even though the underlying columns may be related, e.g., by a functional dependency. While multi-dimensional histograms address this problem for local predicates, again they have never been applied to join predicates, aggregation, etc. Applications common today have hundreds of columns in each table and thousands of tables, making it impossible to know on which subset(s) of columns to maintain multi-dimensional histograms.
Principle of inclusion: The selectivity for a join predicate X.a=Y.b is typically defined to be 1/MAX(|a|, |b|), where |a| denotes the number of distinct values of column a and |b| denotes the number of distinct values of column b. This implicitly assumes the “principle of inclusion”, i.e., that each value of the smaller domain has a match in the larger domain (which is frequently true for joins between foreign keys and primary keys).
When these assumptions are invalid, significant errors in the cardinality, and hence cost, estimates result, causing sub-optimal plans to be chosen. The primary cause of major modeling errors is the cardinality estimate on which costs depend. Cost estimates might be off by 10 or 15 percent, at most, for a given cardinality, but cardinality estimates can be off by orders of magnitude when their underlying assumptions are invalid or uncertain. Although there has been considerable success in using histograms to detect and correct for data skew [IC91, PIHS96, PI97], and in using sampling to gather up-to-date statistics [HS93, UFA98], there has to date been no comprehensive approach to correcting all modeling errors, regardless of origin.
Much of the prior literature on cardinality estimates has utilized histograms to summarize the data distribution of columns in stored tables, for use in estimating the selectivity of predicates against those tables. Recent work has extended one-dimensional equi-depth histograms to mote sophisticated and accurate versions [PIHS96] and to multiple dimensions [PI97]. This classical work on histograms concentrated on the accuracy of histograms in the presence of skewed data and correlations by scanning the base tables completely, at the price of high run-time cost. The work in [GMP97] deals with the necessity of keeping histograms up-to-date at very low cost. Instead of computing a histogram on the base table, it is incrementally derived and updated from a backing sample of the table, which is always kept up-to-date. Updates of the base table are propagated to the sample and can trigger a partial re-computation of the histogram, but there is no attempt to validate the estimates from these histograms against run-time actual statistics.
The work of [CR94] and [AC99] are the first to monitor cardinalities in query executions and exploit this information in future compilations. In [CR94], the result cardinalities of simple predicates after the execution of a query are used to adapt the coefficients of a curve-fitting formula. The formula approximates the value distribution of a column instead of employing histograms for selectivity estimates. In [AC99], the authors present a query feedback loop, in which actual cardinalities gleaned from executing a query are used to correct histograms. Multiple predicates can be used to detect correlation and update multi-dimensional histograms [BCG01]. This approach effectively deals with single-table predicates applied while accessing a base table, but the paper does not deal with join predicates, aggregation, and other operators, nor does it specify how the user is supposed to know on which columns multi-dimensional histograms should be created.
Another research direction focuses on dynamically adjusting a QEP after the execution has begun, by monitoring data statistics during the execution (dynamic optimization). In [KDeW98], the authors introduce a new statistic collector operator that is compiled into the plan. The operator collects the row stream cardinality and size and decides whether to continue or to stop the execution and re-optimize the remainder of the plan. Query scrambling in [UFA98] is geared towards the problem of distributed query execution in wide area networks with uncertain data delivery. Here, the time-out of a data-shipping site is detected and the remaining data-independent parts of the plan are re-scheduled until the problem is solved. Both solutions deal with dynamic re-optimization of (parts of) a single query, but they do not save and exploit this knowledge for the next query optimization run.
In light of the above, there is a need in the art for an effective and comprehensive technique for query optimizers to learn from any modeling mistake at any point in a QEP. There is also a need for such learning optimizers to automatically validate cost estimates against actual costs incurred in the execution of queries. The use of validation would allow models of QEPs to be adjusted for better optimization of future queri

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Learning from empirical results in query optimization does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Learning from empirical results in query optimization, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Learning from empirical results in query optimization will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3246672

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.