Single pass space efficient system and method for generating...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06343288

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to computer database systems, and more particularly to systems and methods for finding quantiles in a data stream.
2. Description of the Related Art
Quantiles, which are elements at specific positions in a sorted data stream or database, are of interest to both database users, designers, and implementers. One reason quantiles are of interest is that they characterize distributions of real world data sets and are less sensitive to outlying data points than are, e.g., the mean value of a data stream, or the variance of a data stream.
As but one example of when a quantile might be useful, a user might want a listing, from a personnel database, of salespeople who are taller than a certain height and who have gross sales above a certain amount. The user would request this information by means of a database query. It is the function of a database management system (dbms) to respond to the query quickly and efficiently. In responding to the query, the dbms typically must reformat the query into a more efficient equivalent query. Then, the dbms evaluates which one of several potential query execution plans would be the most computationally efficient in executing the equivalent query. Because the difference in computational time between an efficient query execution plan and an inefficient plan can be great, it is worthwhile for the dbms to undertake the above-mentioned evaluation.
This is where a knowledge of quantiles in the database can be useful. It happens that in evaluating the efficiency of query execution plans, a dbms relies on statistics that relate to the requested data, and one important statistic is quantiles. To illustrate, suppose in the above example that the amount of gross sales of interest is $500,000, and suppose further that the database contains 100,000 personnel records. If $500,000 is at the 80% quantile of gross sales, the dbms can be assured that at most its response to the query will have 20,000 records, which statistical information is important for generating and evaluating good query plans.
In addition to the above application of quantiles, the ability to determine quantiles has many other applications in the database field. Two such additional applications are database partitioning during parallel processing, and database mining. Thus, the skilled artisan will appreciate that determining quantiles is an important task for many if not most dbms.
Like many other computer tasks, the determination of quantiles must satisfy several practical considerations. Specifically, quantiles should be generated while minimizing the amount of memory space consumed, optimizing computational efficiency, and still producing an exact or at least highly accurate approximate quantile.
First, for computational efficiency it is desirable that the determination of quantiles not require excessive passes over a data stream to sort the data stream. Indeed, requiring only a single pass over a data stream is highly desirable from a computational efficiency viewpoint. Processing data in only a single pass, however, is somewhat challenging in part because no assumptions or guarantees can be made regarding the order of arrival of elements in a data stream or their value distributions. Nevertheless, it is desirable that quantiles be generated in only a single pass without depending on assumptions about the data stream for efficiency or correctness.
Additionally, as stated above the amount of memory required to find quantiles should be minimized. Thus, although one computationally efficient way to find quantiles of a data stream would be to buffer the entire stream in memory and then process the stream, this would require excessive memory and accordingly is not very desirable. Instead, as recognized by the present invention it is desirable to conserve memory, while still promoting computational efficiency.
As also recognized by the present invention, to conserve memory space and at the same time promote computational efficiency, approximate quantiles can be substituted for exact quantiles, depending, of course, on the particular application. For this reason, the present invention recognizes that the accuracy of an algorithm that finds approximate quantiles should be tunable to the level of accuracy required for the application, with its performance degrading gracefully if at all when the accuracy requirements are increased.
In the above-referenced patent application, a method for generating approximate quantiles is disclosed that, unlike the method of Munro et al., in an article entitled “Selection and Sorting with Limited Storage” published in
Theoretical Computer Science
, 12:315-323 (1980), advantageously does not require more than one pass over the data stream and further, unlike the method disclosed in Agrawal et al., in an article entitled “A One-Pass Space-Efficient Algorithm for Finding Quantiles” published in
Proc
. 7
th
Int'l Conf. Management of Data
(1995), advantageously guarantees a bound on the approximation error. The method of the above-referenced patent application does, however, require that the size “N” of the input stream be known a priori.
As recognized by the present invention, in practice the size “N” of the input stream in fact might not be known at the outset. As an example, the input stream might be an intermediate table, the size of which might only be crudely estimated, if at all, prior to quantile computation. When the estimate for “N” is bad, the quantile-generating algorithms of previous methods might fail to provide the required approximation guarantee, or indeed might fail to complete execution altogether.
Fortunately, the present invention understands that a scalable, parallelizable, single-pass algorithm can be provided for generating approximate quantiles within predefined error bounds, even when the size “N” of the input stream is not known beforehand, while minimizing memory size requirements. As set forth more fully below, random, non-uniform sampling of the input stream can be used to achieve this result while minimizing memory space overhead.
SUMMARY OF THE INVENTION
A method is disclosed for determining at least one approximate quantile of a number of elements in a data set in a single pass over the elements while minimizing memory usage and meeting a desired approximation guarantee with a given probability without knowing the number of elements. At least some of the elements may be sampled non-uniformly, and sampled elements are used to fill input buffers. The number and size of the buffers depend at least on the approximation guarantee (and, preferably, the given probability) but not on the number of elements in the data set.
One or more approximate quantiles are output such that the approximate quantiles meet the approximation guarantee with the given probability.
More rigorously, given user-specified approximate quantile &phgr;, user-specified approximation error &egr;, and user-specified probability &dgr;, the present invention computes, in a single pass over a data set of unknown size, an &egr;-approximate &phgr;-quantile with a probability of 1−&dgr;. The &phgr;-quantile of a data set of size N, for &phgr;&egr;[0,1], is defined to be the data element at position ┌&phgr;N┐ in the sorted sequence of the data set. An &egr;-approximate &phgr;-quantile is defined to be any element of the data set whose position lies between the element at position ┌(&phgr;−&egr;)N┐ and the element at position ┌(&phgr;+&egr;)N┐ in the sorted sequence of the data set. As understood herein, several elements of the data set can qualify as an &egr;-approximate &phgr;-quantile. The value &dgr;&egr;┌0,1┐ denotes the probability that the present invention fails to report an &egr;-approximate &phgr;-quantile. Typically, &dgr; lies in the range 0.01 to 0.0001.
From another aspect, the invention is a general purpose computer programmed according to the inventive steps herein to determine a desired ap

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Single pass space efficient system and method for generating... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Single pass space efficient system and method for generating..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Single pass space efficient system and method for generating... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2837451

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.