Method for generating quantiles from data streams

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000, C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06820090

ABSTRACT:

FIELD OF THE INVENTION
This invention relates to methods for deriving statistical information from a stream of data, and for updating that information.
ART BACKGROUND
There are many practical problems of data collection in which it is useful to summarize large volumes of data in a fast and reliable manner, while preserving as much information as possible. One class of problems of that kind relates to the collection of data that reflect the performance of a communication network. One example from that class of problems is the problem of collecting and summarizing the length of time to complete each of a sequence of transactions that take place on a network. In the case, e.g., of e-mail transaction times, the problem is made more difficult by the fact that the data arrive in small increments at random times, and because it is often desirable to reserve for processing and storing the data an amount of memory that is small relative to the volume of data to be processed.
Those of skill in the art have long been acquainted with the histogram as a means for summarizing statistical data. To create a histogram, the practitioner marks off endpoints along an axis that corresponds to the incoming data; i.e., to the measured values of the statistic that is to be characterized. Below, we will refer to the incoming measured values as scores, and to the corresponding axis as the data axis. Accordingly, the endpoints referred to above are marked off along the score axis. Each pair of successive endpoints defines an interval. The height of the histogram within each interval, measured along a probability axis perpendicular to the data axis, is proportional to the number of scores that fall within that interval. Below, we will find it convenient to refer to each such interval as a bucket.
When a histogram is based on an exhaustive set of data, it can dependably represent the statistical distribution of those data. However, if the histogram is based on only a partial set of data, it might not dependably represent the full population from which that partial set was taken. In particular, a histogram based on an initial portion of a stream of data might differ substantially from a histogram based on a longer initial portion or on a subsequent portion of the data stream.
When a stream of data arrives over time, it is often most convenient to characterize the arriving data by taking an initial sequence of data values, creating a histogram, and then updating the histogram using further data values taken from subsequently arriving data. Such a procedure is especially useful when the amount of computer memory available for processing and storing the data is limited.
The quality of a histogram depends on its ability to model the population of data that it is based on, and on its ability to preserve statistical information about that population. In both of these aspects, the quality of a histogram is affected by the setting of the endpoints that define the respective buckets, and also by the procedure used to update the histogram using later-arriving data.
In the statistical study of network performance, among other fields, there has been a recognized need for methods of processing data streams to characterize the data more reliably without sacrificing useful statistical information.
SUMMARY OF THE INVENTION
We have invented an improved method for acquiring statistical information from data, which may, e.g., be already accumulated data or data that are arriving as a data stream. According to the inventive method, an initial cumulative distribution function (CDF) that characterizes an initial set of data is acquired. The acquisition of this CDF comprises acquiring a set of quantile endpoints that define the CDF. The quantile endpoints are endpoints that could be marked off along the data axis in such a way that a defined fraction of the sampled scores would lie within each corresponding bucket.
At least one additional CDF, which characterizes a further set of data, is also acquired. Information that describes the initial CDF is combined with information that describes one or more additional CDFs, and the result is used to obtain a composite CDF that describes a combined set of data that includes the initial data set and the one or more further data sets. Then, a new set of quantile endpoints is determined, that defines the composite CDF. The sequence of steps described above is repeated at least once more. The previously obtained composite CDF is used as the initial CDF for each repetition of this sequence.


REFERENCES:
patent: 5870752 (1999-02-01), Gibbons et al.
patent: 5991332 (1999-11-01), Lomp et al.
patent: 6108658 (2000-08-01), Lindsay et al.
patent: 6229843 (2001-05-01), Lomp et al.
patent: 6272168 (2001-08-01), Lomp et al.
Donald L, Iglehart “Simulating stable Stochastic system, VI: Quantile Estimation”, ACM, vol. 23, No. 2, pp. 347-360, Apr. 1976.*
Athanassios N. Avramidis, “Variance Reduction for quantile estimation via correlation induction”, ACM, pp. 572-576.*
Chen et al., “Simulation based estimation of quantiles”, ACM, pp. 428-434.*
Avramidis et al., “Correlation-induction techniques for estimating quantiles in Simulation experiments”, ACM, pp-268-277.*
Au-Yeung et al., “Efficient apprximation of response time densities and quantiles in stochastic models”, ACM, pp. 151-155.*
Trong Wu, “An Accurate computation of the hypergeometric distribution function”, ACM, vol. 19, No. 1, Mar. 1993, pp. 33-43.*
Ding et al., “Statiscal estimation of the cumulative distribution function for power dissipation in VLSI circuits”, ACM, pp. 1-6.*
F. Chen, D. Lambert, J.C. Pinheiro, “Incremental Quantile Estimation For Massive Tracking”, Proceedings of the ACM-KDD 2000 Conference, pp. 516-522, 2000.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for generating quantiles from data streams does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for generating quantiles from data streams, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for generating quantiles from data streams will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3358125

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.