Method and system for histogram determination in a database

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06691099

ABSTRACT:

BACKGROUND AND SUMMARY
The present invention relates to the field of computer systems. More particularly, the invention relates to a method and system for database optimization.
A “query” is a statement or collection of statements that is used to access a database. Specialized query languages, such as the structured query language (“SQL”) are often used to interrogate and access a database. Many types of queries include at least the following. First, the identity of the database object(s) being accessed to execute the query (e.g., one or more named database tables). If the query accesses two or more database objects, what is the link between the objects (e.g., a join condition or column). The typical query also defines selection criteria, which is often referred to as a matching condition, filter, or predicate. Lastly, a query may define which fields in the database object are to be displayed or printed in the result.
Optimization is the process of choosing an efficient way to execute a query statement. Many different ways are often available to execute a query, e.g., by varying the order or procedure in which database objects and indexes are accessed to execute the query. The exact execution plan or access path that is employed to execute the query can greatly affect how quickly or efficiently the query statement executes.
Cost-based optimization is an approach in which the execution plan is selected by considering available access paths to determine the lowest cost approach to executing the query. In one approach, cost-based optimization consists of the following steps: (1) generating a set of potential execution plans for the database statement to be executed; (2) estimating the cost for each execution plan; and (3) comparing the costs of the execution plans to identify the execution plan having the lowest cost. Conceptually, the term “cost” relates to the amount of a given resource or set of resources needed to process an execution plan. Examples of such resources include I/O, CPU time, and memory. Various measures may be used to identify the execution plan having the lowest cost. For example, the cost-based approach may be used to identify the execution plan providing either the best throughput or the best response time.
Many database optimizers use statistics to calculate the “selectivity” of predicates and to estimate the cost of performing database operations. Statistics quantify characteristics of database and schema objects, such as the data distribution and storage characteristics of tables, columns, indexes, and partitions. Selectivity refers to the proportion or fraction of a database object corresponding to a query predicate. An optimizer uses the selectivity of a predicate to estimate the cost of a particular access method and to determine optimal join order.
Statistics should be gathered on a regular basis to provide the optimizer with needed information about schema objects. Significant costs may be incurred to collect and maintain statistics for database objects. To reduce this collection cost and improve performance, many database systems use data sampling to reduce the amount of data that must be collected to provide statistics used by the optimizer. With data sampling, only a portion of the rows within a database table is accessed to generate a set of statistics for the entire table or column. The results of the data sampling is thereafter scaled upward to extrapolate the statistics values for the entire population. However, different data distributions may require different sample sizes in order to obtain accurate statistics. If a too-small sample size is selected, then the statistics may be inaccurate, which could lead to sub-optimal execution plans and poor query performance. If a too-large sample size is selected, then resources are wasted to collect more data than is needed to provide accurate statistics. Consequently, it is desirable to use only the minimal sample size needed for accurate statistics collection.
In addition to statistics, optimizers often use data value histograms to select an optimal execution plan. A data value histogram is a structure that provides estimates of the distribution of data values in a database object. A histogram partitions the data object values in a set of individual “buckets”, so that all values corresponding to a given range fall within the same histogram bucket. The histogram provides information that is helpful in determining the selectivity of a predicate that appears in a query.
In a height-balanced histogram, each bucket of the histogram corresponds to an equal number of rows in a table. The boundaries of the buckets shrink or grow so that all buckets maintain the same number of entries. The useful information provided by the histogram is the range of values that corresponds to each bucket, e.g., the endpoints for each bucket of the histogram. Consider a column C with values between 1 and 100 in which the column data is uniformly distributed.
FIG. 1
a
shows a height-balanced histogram plotted for this column having ten buckets. The number of rows in each bucket of the histogram is one-tenth the total number of rows in the table. Since the data values are evenly distributed, the endpoints of the buckets are also evenly spaced.
Now consider a second column having 100 rows for which column data values are not evenly spaced, in which ninety rows contain the value “1” and the other ten rows contain a column value between 2 and 100.
FIG. 1
b
shows this column plotted in a height balanced histogram of ten buckets. Since ninety percent of the rows have the value “1”, nine of the ten buckets in the histogram of
FIG. 1
b
also correspond to the value “1”. Thus, it can be seen that nine of the ten buckets in the histogram of
FIG. 1
b
have endpoints that end in the number “1”. The last bucket corresponds to the ten rows in the column having data values between “2” and “100”. In operation, such a histogram provides an optimizer with instant knowledge of the selectivity of particular values of a column. This selectivity information can be used, for example, to determine whether either a full table scan or an index access provides the most efficient path to satisfying a query against the database table corresponding to the histogram.
Other types of histograms also exist. For example, another histogram used by optimizers is the width-balanced histogram, in which column data is divided into a number of fixed, equal-width ranges and the histogram is organized to count the number of values falling within each range.
A histogram may not always provide an appreciable benefit. For example, a histogram may not be useful for a data set having uniform data distribution, since it can be assumed that all data within that set are equally distributed and therefore the histogram will not provide any additional useful information. If a histogram is desired, a significant amount of resources may be needed to collect, maintain, and use histograms. Therefore, it makes sense to only create, store, and/or use a histogram when such a histogram provides benefits greater than the expense of the histogram. However, conventional database systems typically rely upon the skill and knowledge of individual database administrators to manually decide whether histograms should or should not be collected for columns in the database. While guidelines may be provided to assist this decision-making, this manual process by administrators often leads to inconsistent and erroneous decisions resulting in the collection and storage of unneeded histograms, or the failure to collect histograms that could provide more efficient query processing.
The present invention provides a method and system for determining when to collect histograms. In an embodiment, the invention provides a mechanism for automatically deciding when to collect histograms upon request from the user. This decision is based on the columns the user is interested in, the role these columns play in the queries as submitted to the system, and the underlying distribution for these columns, e.g.,

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and system for histogram determination in a database does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and system for histogram determination in a database, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for histogram determination in a database will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3284060

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.