Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-08-25
2001-08-21
Black, Thomas (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C706S019000
Reexamination Certificate
active
06278989
ABSTRACT:
TECHNICAL FIELD
The present invention relates generally to the field of database systems. More particularly, the present invention relates to the field of histogram construction for database systems.
BACKGROUND OF THE INVENTION
Computer database systems manage the storage and retrieval of data in a database. A database comprises a set of tables of data along with information about relations between the tables. Tables represent relations over the data. Each table comprises a set of records of data stored in one or more data fields. The records of a table are also referred to as rows, and the data fields of records in a table are also referred to as columns.
A database server processes data manipulation statements or queries, for example, to retrieve, insert, delete, and update data in a database. Queries are defined by a query language supported by the database system. To enhance performance in processing queries, database servers use indexes to help access data in a database more efficiently. Typical database servers comprise a query optimizer to generate efficient execution plans for queries with respect to a set of indexes. Query optimizers generate execution plans based on histograms and other statistical information on the column(s) of the table(s) referenced in the queries. Query optimizers typically rely on histograms on selected columns to estimate selectivities of queries.
Database servers typically create histograms on the columns of tables over which indexes are constructed. Database servers may also create histograms on columns that do not have indexes to enhance the accuracy of estimates by query optimizers. Creating histograms, however, can incur significant costs in time and memory, particularly for large databases. Although data values of selected columns may be sampled, for example, to generate approximate histograms, the accuracy of such histograms depends on the size of the sample. Sampling too little data limits the ability of query optimizers to generate relatively accurate execution plans while sampling too much data consumes more time and memory.
SUMMARY OF THE INVENTION
Using adaptive random sampling with cross-validation helps determine when enough data of a database has been sampled to construct histograms on one or more columns of one or more tables of the database within a desired or predetermined degree of accuracy.
In accordance with the present invention, a method may be used for constructing a histogram in a database system comprising a database. The method may be implemented in the form of program modules or computer-executable instructions stored on a computer readable medium.
For the method, a histogram is created using an initial sample of data values from the database. The histogram is updated using an additional sample of data values from the database until the histogram is within a predetermined degree of accuracy. The updated histogram may be used to obtain a query optimization estimate.
The histogram may be defined such that the histogram comprises a predetermined number of bins for storing data values, and a predetermined number of data values may be designated for storage in each bin. The predetermined number of data values designated for all but one of the predetermined number of bins may be the same.
In updating the histogram, an error amount for each of a plurality of bins of the histogram may be determined, and the histogram may be determined to be within the predetermined degree of accuracy if the error amount for each bin is less than or equal to a predetermined threshold.
The additional sample of data values may be partitioned over the histogram. The data values of the additional sample may be partitioned into a plurality of bins in accordance with the histogram. An error in distribution of the additional sample of data values over the histogram may be determined. The difference in the number of data values of the additional sample in each bin from a predetermined number of data values designated for the bin may be determined, and the error in distribution based on the determined differences may be determined. The predetermined number of data values designated for each bin may be approximately equal to the total number of data values of the additional sample divided by the number of bins. The error in distribution may be determined as the maximum of the determined differences or as the maximum fractional difference between the number of data values of the additional sample in any bin as compared to the predetermined number of data values designated for that bin. The histogram may be updated using the additional sample of data values. The histogram may be repeatedly updated in this manner until the error in distribution of the additional sample of data values over the histogram is less than or equal to a predetermined threshold indicating the histogram is within the predetermined degree of accuracy.
A plurality of distinct ranges defined by a sequence of distinct data value separators of the histogram may be determined, and the data values of the additional sample may be partitioned into the plurality of distinct ranges. The error in distribution may be determined based on the partitioning of the data values of the additional sample into the plurality of distinct ranges. The error in distribution may be determined as the maximum fractional difference between the fraction of data values of the additional sample in any distinct range as compared to the fraction of data values of all prior samples in that distinct range.
The initial sample may be a random sample of data values from one relation of data of the database approximately linearly proportional in number to the square root of the total number of data tuples from the one relation of data of the database. The initial sample may also be a random sample of data values approximately linearly proportional in number to the number of bins of the histogram. The initial sample may also be a random sample of data values approximately inversely proportional in number to the square of the predetermined threshold. Each i
th
additional sample may be a random sample of data values approximately 2
i−1
time(s) the number of data values in the initial sample. The initial sample of data values and each additional sample of data values may be in units of disk blocks each having a predetermined number of data values.
REFERENCES:
patent: 5689696 (1997-11-01), Gibbons et al.
patent: 5870752 (1999-02-01), Gibbons et al.
patent: 5960435 (1999-09-01), Rathmann et al.
patent: 5974536 (1999-10-01), Richardson
patent: 6052689 (2000-04-01), Muthukrishnan et al.
Chaudhuri, Surajit, et al., “An Efficient, Cost-Driven Index Selection Tool for Microsoft SQL Server,” Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB), Athens, Greece, pp. 146-155 (Aug. 25-29, 1997).
Chaudhuri, Surajit, et al., “AutoAdmin ‘What-if’ Index Analysis Utility,” Proceedings of ACM SIGMOD, Seattle, Washington, pp. 367-378 (Jun. 1-4, 1998).
Chaudhuri, Surajit, et al., “Random Sampling for Histogram Construction: How Much is Enough?” Proceedings of ACM SIGMOD, Seattle, Washington, pp. 436-447 (Jun. 1-4, 1998).
Finkelstein, S., et al., “Physical Database Design for Relational Databases,”ACM Transactions on Database Systems(TODS), vol. 13, No. 1, pp. 91-128 (Mar. 1988).
Gibbons, Phillip B., et al., “Fast Incremental Maintenance of Approximate Histograms,” Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB), Athens, Greece, pp. 466-475 (Aug. 25-29, 1997).
Haas, Peter J., et al., “Sampling-Based Estimation of the Number of Distinct Values of an Attribute,” Proceedings of the 21st International Conference on Very Large Data Bases (VLDB), Zurich, Switzerland, pp. 311-322 (1995).
Haas, Peter J., et al., “Sequential Sampling Procedures for Query Size Estimation,” Proceedings of ACM SIGMOD International Conference on Management of Data, San Diego, California, pp. 341-350 (Jun. 2-5, 1992).
Hou, Wen-Chi, et al., “Statistical Estimators for Relational Algebra Expressions,”
Chaudhuri Surajit
Motwani Rajeev
Narasayya Vivek
Black Thomas
Coby Frantz
Microsoft Corporation
Watts, Hoffmann, Fisher & Heinke Co. L.P.A.
LandOfFree
Histogram construction using adaptive random sampling with... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Histogram construction using adaptive random sampling with..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Histogram construction using adaptive random sampling with... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2511252