Method and system for generating a statistical summary of a...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000

Reexamination Certificate

active

06477534

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Technical Field
The present invention relates to the field of database systems. More particularly, the present invention relates to a method for generating approximate answers in a large data warehousing environment in response to complex aggregate queries.
2. Description of the Related Art
Traditional query processing has focused solely on providing exact answers to queries in a manner seeking to minimize response time and maximize throughput. Providing an exact answer to a complex query in a large data recording and warehousing environment, however, can take minutes or even hours due to the amount of computation and disk I/O that may be required.
There are a number of scenarios in which a rapidly-obtained approximate answer is desired instead of an exact answer. For example, initial queries during a drill-down query sequence in ad-hoc data mining are used for determining interesting queries. See, for example, J. M. Hellerstein et al., Online aggregation, Proc. ACM SIGMOD International Conf. on Management of Data, pp. 171-182, May 1997, which is incorporated by reference herein. An approximate answer can provide feedback regarding how well a query is posed. An approximate answer can also provide a tentative answer to a query when the base data of a database is unavailable. An approximate answer can be used in a query requesting a numerical answer for which the full precision of the exact answer is not needed, e.g., a total, an average or a percentage for which only the first few digits of precision are of interest. Fast approximate answers can be used in a more traditional role within a query optimizer for estimating plan costs because a fast response time is required without an exact answer.
An approximate query answering system has two key requirements: (1) an accurate estimate of the exact answer, and (2) tight bounds on the confidence of the estimate. To illustrate the obstacles in achieving the requirements of an approximate query answering system, consider a set of uniform random samples of each base relation (referred to herein as base samples) in a database. Intuitively, such a set of base samples is a natural set of synopses (samples) for an approximate query engine. Both the accuracy of the estimate and the spread of confidence bounds strongly depend on the sample size used for generating the set of uniform random samples. Unless certain statistical properties can be guaranteed for the sample, the corresponding bounds are usually particularly pessimistic.
A necessary, but insufficient condition for the join of the base samples to be a uniform random sample of a join on base relations is for the probability for any two joined tuples to be in the join of the base samples to equal the probability for the (same) two tuples to be in a sample of the join on the base relations. Generally, though, the join of two uniform random base samples is not a uniform random sample of the join on the base relations. Consequently, a straightforward sampling approach can produce a poor quality estimation when aggregates on the tuples in multi-way joins are approximated based on a statistical guarantee of the multi-way join and the join output size. In most cases, the non-uniformity introduced by the join significantly degrades the accuracy of the answer and the confidence bounds. Moreover, the join of two uniform random samples typically has very few tuples, even when the join of the base relations has many tuples, thereby leading to both inaccurate answers and poor confidence bounds because the quality of the estimate critically depends on the number of tuples on which the estimate is based.
To show that a multi-way join can produce a poor quality approximation, consider a simple 2-way (equality) join of base samples of two relations R and S on an attribute X shown in FIG.
1
. The letters a and b in
FIG. 1
denote tuples with values a and b for an attribute X. Each segment (denoted as an edge herein) between an a-tuple or a b-tuple on the left-half of
FIG. 1
(R.X) and the a-tuple or b-tuple in the right-half (S.X) depict a tuple that is in the join of R and S. Assume that the probability for each tuple to be in a base sample is 1/r, for a given r>1. As shown in
FIG. 1
, edges a
1
and a
2
are in the join if and only if both a-tuples are selected from R and the one a-tuple is selected from S. Such a sample selection occurs with probability 1/r
3
because three tuples must be selected.
On the other hand, edges a
1
and b
1
are in the join if and only if the four tuples incident to these particular edges are selected. Such a sample selection occurs with a probability of 1/r
4
because four tuples must be selected. This contrasts with the fact that for a uniform random sample of the join of R and S, the probability that both edges a
1
and a
2
are selected equals the probability that both edges a
1
and b
1
are selected. Thus, generally, for any pair of relations joining on an attribute X, any X value that occurs in each relation and occurs more than once in at least one of the relations introduces a bias such that the join of the base samples is not a uniform random sample of the output of the join.
Now consider a second problem that is related to small output sizes. Consider two relations A and B, and base samples comprising 1% of each relation. The size of a foreign key join between relations A and B is equal to the size of relation A. The expected size of the join of the base samples is, however, 0.01% of the size of relation A because for each tuple in A, there is only one tuple in relation B that joins with it, and the probability that the particular tuple in B is in the sample for B is only 1%. Generally, for a k-way foreign key join and k base samples each comprising 1/r of the tuples in their respective base relations, the expected size of the join of the base samples is 1/r
k
of the size of the actual join. The best known confidence interval bounds for approximate join aggregates based on base samples are quite pessimistic, as discussed by P. J. Haas, Large-sample and deterministic confidence intervals for online aggregation, Proc. 9th International Conf. on Scientific and Statistical Database Management, August 1997, and which is incorporated by reference herein. It is generally impossible to produce good quality approximate answers using samples on the base relations alone. Nevertheless, it is critical to overcome this problem because nearly all queries in a warehousing context involve complex queries have a large number of (foreign-key) joins.
Recently, there has been a flurry of work in approximate query answering. See, for example, S. V. Vrbsky et al., Approximate—a query processor that produces monotonically improving approximate answers, IEEE Trans. on Knowledge and Data Engineering, 5(6): 1056-1068, 1993; D. Barbará et al., The New Jersey data reduction report, Bulletin of the Technical Committee on Data Engineering, 20(4):3-45, 1997; P. B. Gibbons et al., Aqua project white paper, Technical report, Bell Laboratories, Murray Hill, N.J., December 1997; P. B. Gibbons et al., Fast incremental maintenance of approximate histograms, Proc. 23rd International Conf. on Very Large Data Bases, pp. 466-475, August 1997; J. M. Hellerstein et al., supra; and P. B. Gibbons et al., New sampling-based summary statistics for improving approximate query answers, Proc. ACM SIGMOD International Conf. on Management of Data, pp. 331-342, Jun. 1998, each of which is incorporated by reference herein. Only the work of Hellerstein et al., supra, has considered the problem of approximate join aggregates, which is an important problem because most non-trivial queries, especially for data warehousing schemas, involve joining two or more tables. For example, 13 of the 17 queries of the TPC-D benchmark involve queries on joins.
Statistical techniques for providing approximate query answering have been applied in databases for more than two decades, but primarily within the context of a query optimizer for selectivity estimation. See, for example, P. G. Selin

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and system for generating a statistical summary of a... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and system for generating a statistical summary of a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for generating a statistical summary of a... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2988145

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.