Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-09-28
2001-07-17
Black, Thomas (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, 36, C340S870030
Reexamination Certificate
active
06263345
ABSTRACT:
BACKGROUND OF THE INVENTION
This invention relates to the filed of database query optimizers, and more particularly, to an improved method and apparatus for using and manipulating histogram statistics to more accurately estimate the number of rows and unique entry counts where a “UECs” is the number of unique values represented within any particular interval of a histogram, in each histogram interval that will be produced by relational operators and predicates in a database query system. Where relational operators are operators that receive one or more tables as input and produce a new table as an output. Join, Union and Union All are examples of operators that receive two tables as inputs. Group-by and Sort are examples of relational operators that receive only one table as input. Relational operators contain or specify the predicates applied during their execution. In addition a “predicate” is an operation that specifies a comparison between two values, e.g., equal to, greater than, not equal to, greater than or equal to, less than or equal to, less than or is null. The method and apparatus can accurately model run time data flow through the nodes of a query tree, thereby enabling the associated optimizer to accurately select the best plan for a particular performance goal.
Computers have the capability of storing vast amounts of data. For example, computers can store and retain data related to thousands of employees of large multi-national corporations, including the departments in which they work, their employee numbers, salaries, job descriptions, geographical locations, etc. This data is often stored in the form of tables in a relational database. In order to extract selected portions of that data from such large computerized databases, users can present a query to the database system in the form of an SQL statement. For example, an SQL statement may be used to ask the database system to list the names of all employees having employee numbers
1001
to
2000
. A properly structured SQL statement will result in a list of records that satisfies the question or “query.” In this example, the query would produce the names of 1000 employees, assuming that the employees had sequential employee numbers.
Once a user inputs an SQL query into the computer, an SQL compiler operates on the SQL query to develop an efficient way to extract the desired information from the database. Typically, the SQL compiler converts the SQL statement into a number of relational operators stored in computer memory in the form of a query tree. Each node of the tree represents a relational operator, such as a “sort” or “merge” operator. The optimizer portion of the compiler explores a large number of different logically equivalent forms of the query tree, called “plans”, for executing the same query. The optimizer program selects, for example, the plan with the lowest estimated cost to respond to the query, and that plan is then executed. In database parlance, “cost” is usually measured in terms of the amount of computer resources utilized by the computer in executing the SQL statement, for example, the number of I/O's or CPU instructions.
The prior art has focused on various techniques, such as the use of histograms, for developing statistics to describe the distribution of data in the database tables upon which the database programs operate. For example, it has been recognized that gathering accurate statistics about the data in the tables is important to the estimate of predicate selectivity. However, both predicate and relational operators can affect the number of rows and UECs that are returned by an operator as the associated algorithm processes the query. The ability to accurately predict the number of rows and UECs returned by both relational operators and predicates is fundamental to computing the cost of an execution plan. The estimated cost, of course, drives the optimizer's ability to select the best plan. Accordingly, there is a need for a method and apparatus that, not only accurately assembles statistics about the tables of raw data to be processed by the database software, but also for a method and apparatus that can accurately predict the number of rows and UECs for each histogram interval that will be returned by any predicate or relational operator in a query tree.
SUMMARY OF THE INVENTION
The present invention includes a new histogram synthesis method and apparatus for predicting the number of rows of data from a relational database and UECs that will be produced by each predicate and relational operator in a query tree. According to an embodiment of the present invention, histogram statistics about data that is to be presented to an operator are generally synthesized “bottom up” along the query tree from the leaf nodes to the root node. In addition, however, those operators below the right, or inner, child of a Nested Join are also presented histogram information by their parent operator(s). Statistical histograms for the leaf nodes of a query tree are generated from the statistical information derived during a binder phase of query compilation.
Given input statistics for each operand of a relational operator, the present invention merges those input statistics for that operator in a fashion reflecting the actual run time effect of that operator on the data. These merged statistics are descriptive of the predicted output of the operator. Similarly, the present invention determines the predicted effect of predicates on the data by applying the predicates directly to the relevant histogram that represents the data. Therefore, unlike certain traditional techniques, the present invention allows the effects of partially or wholly redundant predicates to be handled without the need for sophisticated logic to detect such redundancies. Moreover, the invention recognizes and accounts for the fact that relational operators can impact histogram statistics in a manner that is independent of the predicates.
At each node of the query tree, the present invention provides an accurate prediction of the number of rows and UECs that will be presented to the relational operator and/or predicate and that will be produced by that relational operator and/or predicate. Query cost model software can then utilize these statistics to accurately predict the cost of a particular plan. Thus, the associated optimizer can select the plan that efficiently accommodates the desired performance goal.
REFERENCES:
patent: 5367675 (1994-11-01), Cheng et al.
patent: 5546570 (1996-08-01), McPherson et al.
patent: 5548755 (1996-08-01), Leung et al.
patent: 5619692 (1997-04-01), Malkemus et al.
patent: 5625815 (1997-04-01), Maier et al.
patent: 5630120 (1997-05-01), Vachey
patent: 5689696 (1997-11-01), Gibbons et al.
patent: 5696686 (1997-12-01), Sanka et al.
patent: 5717911 (1998-02-01), Madrid et al.
patent: 5724570 (1998-03-01), Zeller et al.
patent: 5761654 (1998-06-01), Tow
patent: 5803914 (1998-09-01), Ryals et al.
patent: 5819255 (1998-10-01), Celis et al.
patent: 5822747 (1998-10-01), Graefe et al.
patent: 5870752 (1999-02-01), Gibbons et al.
patent: 5942986 (1999-08-01), Shabot et al.
patent: 5963957 (1999-10-01), Hoffberg
patent: 6012054 (2000-01-01), Seputis
patent: 6021405 (2000-02-01), Celis et al.
Article by Clifford A. Lynch entitled “Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distributions of Column Values” published by University of California, dated 1988 pp. 240-251.
Article by Piatetsky-Shapiro et al. entitled “Accurate Estimation of the Number of Tuples Satisfying a Condition” published by ACM dated, 1984 pp. 256-276.
Article by Haas et al. entitled “Sampling-Based Estimation of the Number of Distinct Values of an Attribute” published by Proceedings of the 21stVLDB Conference, dated 1995 pp. 311-321.
Article by Yannis E. Ioannidis entitled “Universality of Serial Histograms” published by Proceedings of the 19th VLDB Conference, dated 1993 pp. 256-267.
Article by Mackert et al. Entitled “R* Optimizer Validation and Performance Evaluation for Distributed
Celis Pedro
Farrar Christopher M.
Leslie Harry A.
Shak Diana L.
Black Thomas
Compaq Computers Corporation
Fenwick & West LLP
Rones Charles L.
LandOfFree
Histogram synthesis modeler for a database query optimizer does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Histogram synthesis modeler for a database query optimizer, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Histogram synthesis modeler for a database query optimizer will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2450557