Identifying essential statistics for query optimization for...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06363371

ABSTRACT:

TECHNICAL FIELD
The present invention relates generally to the field of database systems. More particularly, the present invention relates to the field of query optimization for database systems.
BACKGROUND OF THE INVENTION
Computer database systems manage the storage and retrieval of data in a database. A database comprises a set of tables of data along with information about relations between the tables. Tables represent relations over the data. Each table comprises a set of records of data stored in one or more data fields. The records of a table are also referred to as rows, and the data fields of records in a table are also referred to as columns.
A database server processes data manipulation statements or queries, for example, to retrieve, insert, delete, and update data in a database. Queries are defined by a query language supported by the database system. To enhance performance in processing queries, database servers use indexes to help access data in a database more efficiently. Typical database servers comprise a query optimizer to generate efficient execution plans for queries with respect to a set of indexes. Query optimizers typically rely on statistics to derive cost estimates and choose among possible execution plans for a query. The availability of statistics can greatly improve cost estimation and the quality of execution plans chosen by the query optimizer.
Statistics may be created and maintained on a table, an index, a single column of a table, or combinations of columns of a table, although the structure of statistics may vary from system to system. Single column statistics typically comprise a histogram of values in the domain of that column and may include one or more of the following parameters: the number of distinct values in the column, the density of values in the column, and the second highest and the second lowest values in the column. Multi-column statistics typically represent information on the distribution of values over the Cartesian product of the domains in it. As one example, multi-column statistics on (R
2
.c, R
2
.d) may contain information on the joint distribution of values over R
2
.c and R
2
.d. In Microsoft® SQL Server, for example, such multi-column statistics would contain joint density information and a histogram on the leading dimension R
2
.c. The single and multi-column statistics available for a database make cost estimation significantly more accurate and help the query optimizer arrive at better query execution plans. In the absence of statistics, cost estimates can be dramatically different often resulting in a poor choice of the execution plan.
Creating as well as maintaining statistics, however, can incur significant costs in storage, time, and memory, particularly for large databases. The space of single and multi-column statistics can be very large because many combinations of columns are possible. Statistics can therefore consume significant amounts of secondary storage. Also, to remain effective, statistics need to be updated as the data in the database changes. The cost of updating statistics on columns of large tables can be substantial. Updating statistics on a column requires scanning the table for values in that column and sorting the values to produce, for example, a histogram and other statistics. Furthermore, the query optimizer loads all potentially relevant statistics for a query into memory during optimization. Multiple users concurrently running against a database server, for example, can incur significant costs in CPU time and memory for loading statistics.
SUMMARY OF THE INVENTION
An essential statistics identification utility tool attempts to reduce or minimize the overhead associated with statistics by identifying from an initial set of statistics a set of essential statistics that provide a query optimizer with the ability to choose among query execution plans with minimized loss in accuracy as compared to using the initial set of statistics.
A method identifies statistics for use in executing one or more queries against a database. The method may be implemented by computer-executable instructions of a computer readable medium. A database system may perform the method with suitable means.
For the method, an initial set of statistics is identified. A subset of the initial set of statistics equivalent to the initial set of statistics with respect to each query is then identified. The subset of statistics is equivalent to the initial set of statistics if an execution plan for each query using the subset of statistics is the same as an execution plan for that query using the initial set of statistics and/or if a cost estimate to execute each query against the database using the subset of statistics is within a predetermined amount of a cost estimate to execute that query against the database using the initial set of statistics. The predetermined amount may be a predetermined percentage of the cost estimate to execute that query against the database using the initial set of statistics. Also, the predetermined amount may be zero.
The subset of statistics may be identified such that any proper subset of the subset of statistics is not equivalent to the initial set of statistics with respect to each query and/or such that an update cost or size for the subset of statistics is minimized.
The subset of statistics may be identified by removing statistics from the initial set of statistics if the remaining subset of statistics is equivalent to the initial set of statistics with respect to each query.
The one or more queries may be identified from a workload of queries as the one or more queries that have estimated execution costs greater than any other query of the workload and that account for at least a predetermined percentage of a total estimated execution cost for the workload. Each query of the workload may be identified such that the subset of statistics is not equivalent to the initial set of statistics with respect to that query, and another subset of statistics may be identified by removing statistics from the initial set of statistics if the remaining subset of statistics is equivalent to the initial set of statistics with respect to each such identified query.
For each statistic of the initial set of statistics, a respective set of queries may be identified from a workload of queries such that that statistic is potentially relevant to each query in the respective query set and such that each query in the respective query set has estimated execution costs greater than any other potentially relevant query of the workload. For each statistic of the initial set of statistics, whether the initial set of statistics without that statistic is equivalent to the initial set of statistics with respect to each query in the respective query set may then be determined, and, if not, that statistic is included in a first subset of statistics. The one or more queries may then be identified from the workload as each query of the workload such that the first subset of statistics is not equivalent to the initial set of statistics with respect to that query.
The subset of statistics may be identified by identifying a subset of the initial set of statistics, determining whether such an identified subset of statistics is equivalent to the initial set of statistics with respect to each query, and repeating these steps for other subsets of the initial set of statistics. These steps may be repeated until an identified subset of statistics is equivalent to the initial set of statistics with respect to each query. Subsets of the initial set of statistics may be identified in increasing order of update cost or size.
The subset of statistics may be identified by identifying a seed subset of the initial set of statistics, adding a subset of the remaining subset of the initial set of statistics to the seed subset to produce a current subset of statistics, determining whether the current subset of statistics is equivalent to the initial set of statistics with respect to each query, and repeating these steps for other subsets of th

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Identifying essential statistics for query optimization for... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Identifying essential statistics for query optimization for..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Identifying essential statistics for query optimization for... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2845460

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.