Query optimization through the use of multi-column...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06272487

ABSTRACT:

A portion of the disclosure of this patent document contains material which is subject co copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to a database management system for optimizing queries, and more specifically for determining an estimation for a number of qualified rows of a query (i.e., a selectivity value), e.g., for use in determining optimized access paths.
2. Description of the Related Art
Databases are computerized information storage and retrieval systems. A relational database management system (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. A DBMS is structured to accept commands to store, retrieve, and delete data. One widely used and well known set of commands is called the Structured Query Language (SQL).
Relational databases are organized into tables which consist of rows and columns of data. The rows are formally called tuples. A database will typically have many tables and each table will typically have multiple columns. The tables are typically stored on random access storage devices (DASD) such as magnetic or optical disk drives for semi-permanent storage.
Tables are at the center of a relational database engine; and a major goal of the underlying query compiler is to provide a suite of mechanisms so that user data can be easily stored and efficiently manipulated.
An important function of query optimization is estimating the number of qualified rows accurately. This estimation is used to determine an optimized access path by the DBMS to the data. An inherent problem occurs during this estimation process when the query has predicates (local or join) on multiple columns of a table. Most database optimizers make an assumption that there is no relationship between those predicates; and they evaluate the predicates independently when estimating the number of qualified rows. This independence assumption, however, may, in many cases, be incorrect and may cause a very inaccurate qualifying row estimate. The following example illustrates this problem.
SELECT*FROM INVENTORY_TABLE
WHERE CURR_LOCATION=‘LOC1’
AND ORIG_LOCATION=‘LOC1’
Number of rows in INVENTORY_TABLE=500,000
Number of distinct values in CURR_LOCATION=50
Number of distinct values in ORIG_LOCATION=50
Selectivity is the percentage of rows that would qualify. Selectivity of “CURR_LOCATION=‘LOC1’” is evaluated to be=“0.02” (i.e., one out of every fifty rows would qualify). Selectivity of “ORIG_LOCATION=‘LOC1’” is evaluated to be=“0.02”. The number of qualified rows from INVENTORY_TABLE is evaluated to be:
(0.02*0.02)*500000=200
In the previous example the optimizer assumed that there was no relationship between the two predicates in the query and evaluated them as if they were independent. In this case, however, assume that most items in the INVENTORY_TABLE have the same value for both the CURR_LOCATION and ORIG_LOCATION columns. In this case the actual number of rows that qualified would be much larger than the estimated number.
One type of statistic that takes into consideration multiple columns is called “FULLKEYCARD”. A “FULLKEYCARD” concatenates all of the columns in an index for indicating the number of distinct key values in an index. It has been previously used by DB2/MVS and other RDBMS vendors. While FULLKEYCARD does help with column correlation in limited circumstances, it has its limitations. For example, FULLKEYCARD can only be used for an index that has the same set of predicates that are in the query. Since it cannot be applied to non-indexed columns, it is not very well suited to solve the problem of column correlation discussed above.
SUMMARY OF THE INVENTION
It is therefore an object of this invention to improve performance for queries that have predicates on multiple columns of a table.
It is a further object of this invention to take into consideration the relationship between predicates when estimating the number of qualified rows from columns in a query.
It is a further object of this invention to use multi-column statistics to compute an estimate of the number of qualified rows when a query has predicates on multiple columns of a table.
This invention reduces the problem caused by column correlation during query optimization by removing the independence assumption when a new type of multi-column statistic is available. The system, method, and program of this invention collects multi-column statistics by relational DBMSs, and uses these statistics during query optimization to obtain an estimate of the number of qualifying rows when a query has predicates on multiple columns of a table.
The DBMS collects meaningful statistical data from the table of a database by analyzing the actual data in the database. The DBMS then stores the statistics in the DBMS system catalogs, i.e., tables, and applies this data to determine in advance a number of qualifying rows in order to further optimize the query based on (i.e., dependent upon) this number of estimated qualifying rows.
A database optimizer calculates a combined predicates filter factor when these predicate's columns are correlated. The optimizer chooses among point, linear or polygonal statistics.
More specifically, the system, method, and program of this invention collects multi-column statistics to reflect a relationship among multiple columns of a table. A multi-column cardinality statistic is collected by concatenating columns and counting the number of distinct concatenated values. A multi-column frequent value statistic is collected by concatenating columns and determining a frequency of the concatenated values. A multi-column linear quantile statistic is collected by dividing the data of multiple columns into sub-ranges where each sub-range has approximately an even distribution of data, and determining a frequency and cardinality of each sub-range. A multi-column polygonal quantile statistic is collected by dividing the data of multiple columns into sub-spaces where each sub-space contains approximately the same number of tuples, and determining a frequency and cardinality of each sub-space. Although only one sub-space can be used, accuracy will increase with increasingly more sub-spaces.
These statistics are stored in a table such as the system catalogs of the database management system.
The system catalog is accessed for the stored multi-column cardinality statistic or for the multi-column frequent value for a query having equal predicates. A selectivity value is determined as the inverse of the cardinality statistic; or the frequency of the frequent value statistic if there are multi-column frequent value statistics for the same columns as the equal predicates in the query, and the literals in the predicates match the literals in the stored multi-column frequent value. If the literals in the predicates do not match the literals in the stored multi-column frequent value, then a selectivity value is determined as:
(1
-
SUM(Frequency of all
multi-column frequent values))
(“multi-column” cardinality)
-
(number of “multi-column” frequent values
.
The system catalog is accessed for the stored multi-column linear quantile statistic for a query having a single range predicate and at least one equal predicate to determine the selectivity value for the predicates of the query. For predicates that are completely satisfied by only one of the sub-ranges, the selectivity value is the frequency. Otherwise, the selectivity is determined by the selectivity of the fully-qualified sub-ranges plus a selectivity of partially qualified sub-ranges. This selectivity determination translates a boundary of a sub-range by concat

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Query optimization through the use of multi-column... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Query optimization through the use of multi-column..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Query optimization through the use of multi-column... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2514274

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.