Sampling for database systems

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06532458

ABSTRACT:

BACKGROUND OF THE INVENTION
Computer database systems manage the storage and retrieval of data in a database. A database comprises a set of tables of data along with information about relations between the tables. Tables represent relations over the data. Each table comprises a set of records or tuples of data stored in one or more data fields. The records of a table are also referred to as rows, and the data fields of records in a table are also referred to as columns. A database server processes data manipulation statements or queries, for example, to retrieve, insert, delete, and update data in a database. Queries are defined by a query language supported by the database system.
For large databases such as data warehouses, for example, typical tools such as On Line Analytical Processing (OLAP) and data mining serve as middleware or application servers that communicate data retrieval requests to a backend database system through a query. Although the cost of executing ad-hoc queries against the backend can be expensive, many data mining applications and statistical analysis techniques can use a sample of the data requested through the query. Similarly, OLAP servers that answer queries involving aggregation (e.g., “find total sales for all products in the NorthWest region between Jan. 1, 1998 and Jan. 15, 1998”) benefit from the ability to present to the user an approximate answer computed from a sample of the result of the query posed to the database.
Sampling is preferably supported not only on existing stored or base relations but also on relations produced as a result of an arbitrary query. Sampling may be supported in relational databases as a primitive operation SAMPLE(R,f), for example, to produce a sample S of r tuples that is an f-fraction of a relation R. Fully evaluating a query Q to compute relation R only to discard most of relation R when applying SAMPLE(R,f), however, is inefficient. Preferably, query Q may be partially evaluated so as to produce only sample S of relation R.
For a given query tree T for computing a relation R that is the result of a query Q where SAMPLE(R,f) is the root or last operation of query tree T, pushing the sample operation down tree T toward its leaves would help minimize the cost of evaluating query Q as only a small fraction of stored and/or intermediate relations would be considered in evaluating query Q. The ability to commute the sample operation in this manner, however, depends on the relational operations used in query tree T. The standard relational operation of selection can be freely interchanged with sampling. With join operations, however, sampling may not be so easily commuted.
FIG. 1
illustrates a query tree
100
for obtaining a sample of a join of operand relations R
1
and R
2
. Query tree
100
is executed in accordance with a flow diagram
200
of FIG.
2
. For step
202
of
FIG. 2
, a relation J is computed by joining R
1
and R
2
, or J=R
1
R
2
. For step
204
, r tuples are randomly sampled from relation J to produce a sample relation S. Commuting the sample operation in query tree
100
to operand relations R
1
and R
2
, as illustrated by a query tree
300
in
FIG. 3
, would minimize the cost of obtaining a join sample because only samples of operand relations R
1
and R
2
would need to be joined. A join of samples of operand relations R
1
and R
2
, however, will not likely give a random sample of the join of operand relations R
1
and R
2
.
As one example:
R
1
(
A,B
)={(
a
1
,b
0
), (
a
2
,b
1
), (
a
2
,b
2
), (
a
2
,b
3
), . . . , (
a
2
,b
n
)}
and

R
2
(
A,C
)={(
a
2
,c
0
), (
a
1
,c
1
), (
a
1
,c
2
), (
a
1
,c
3
), . . . , (
a
1
,c
n
)}.
That is, relation R
1
is defined over attributes A and B. Among the n+1 tuples of relation R
1
, one tuple has an A-value a
1
and n tuples have an A-value a
2
, but all n+1 tuples of relation R
1
have distinct B-values. Similarly, relation R
2
is defined over attributes A and C. Among the n+1 tuples of relation R
2
, n tuples have an A-value a, and one tuple has an A-value a
2
, but all n+1 tuples of relation R
2
have distinct C-values.
Computing the equi-join of relations R
1
and R
2
over attribute A produces the following relation:
J=R
1
R
2
={(
a
1
,b
0
,c
1
), (
a
1
,b
0
,c
2
), (
a
1
,b
0
,c
3
), . . . , (
a
1
,b
0
,c
N
), (
a
2
,b
1
,c
0
), (
a
2
,b
2
,c
0
), (
a
2
,b
3
,c
0
), . . . , (
a
2
,b
n
,c
0
)}.
That is, relation J has n tuples with A-value a
1
and n tuples with A-value a
2
.
About one half of the tuples in a random sample S of relation J, or S⊂J. would likely have an A-value of a
1
while the remaining tuples would have an A-value of a
2
. A random sample S
1
of relation R
1
, or S
1
⊂R
1
, however, would not likely comprise tuple (a
1
,b
0
), and a random sample S
2
of relation R
2
, or S
2
⊂R
2
, would not likely comprise tuple (a
2
,c
0
). The join of samples S
1
and S
2
would then likely comprise no tuples and therefore would not likely give random sample S of relation J.
One prior sampling strategy for obtaining a sample S of a join of two relations R
1
and R
2
with respect to a join attribute A is illustrated as a flow diagram
400
in FIG.
4
.
For notational purposes, relations R
1
and R
2
have sizes n
1
and n
2
, respectively. The domain of join attribute A is denoted by D. For each value v of domain D, or v∈D, m
1
(v) and m
2
(v) denote the number of distinct tuples in relations R
1
and R
2
, respectively, that contain value v in attribute A. Then, &Sgr;
v∈D
m
1
(v)=n
1
and &Sgr;
v∈D
m
2
(v)=n
2
. A relation J results from the computation of the join of relations R
1
and R
2
, or J=R
1
R
2
, and n is the size of relation J, or n=|J|=|R
1
R
2
|. Then, n=&Sgr;
v∈D
m
1
(v)m
2
(v). For each tuple t of relation R
1
, the set of tuples in relation R
2
that join with tuple t is denoted as J
t
(R
2
)={t′∈R
2
|t′.A=t.A}; t
R
2
denotes the set of tuples in R
1
R
2
obtained by joining tuple t with the tuples in J
t
(R
2
); and |t
R
2
|=|J
t
(R
2
)|=m
2
(t.A). Similarly for each tuple t of relation R
2
, J
t
(R
1
)={t′∈R
1
|t′.A=t.A}; R
1
t denotes the set of tuples in R
1
R
2
obtained by joining tuples in J
t
(R
1
) with tuple t; and |R
1
t|=|J
t
(R
1
)|=m
1
(t.A).
For step
402
of
FIG. 4
, a variable r is initialized to the size of a sample relation S to be obtained from the join of relations R
1
and R
2
. For step
404
, a variable M is initialized to the upper bound on the number of join attribute values v in relation R
2
for all values v of domain D on attribute A. That is, M is the maximum number of any one join attribute value in relation R
2
. A tuple t
1
is randomly sampled from relation R
1
for step
406
. A tuple t
2
is then randomly sampled for step
408
from among all tuples of relation R
2
having a join attribute value t
2
.A that matches the join attribute value t
1
.A of tuple t
1
. For step
410
, a tuple T is computed as T=t
1
t
2
and output for sample relation S with a probability based on the number of tuples in relation R
2
having a join attribute value that matches that of tuple t, divided by M, or m
2
(t
2
.A)/M. If not output, the sample tuple t
1
is rejected for step
410
. If r tuples have not yet been output for sample relation S as determined for step
412
, steps
406
through steps
412
are then repeated until r tuples have been output to form sample relation S as determined for step
412
. Flow diagram
400
then ends for step
414
.
The sampling technique of
FIG. 4
in practice, however, requires indexes for random access to relations R
1
and R
2
, noting relation R
1
must be materialized for proper sampling because the rejection of tuples for step
410
requires that the number of samples from relation R

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Sampling for database systems does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Sampling for database systems, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Sampling for database systems will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3077880

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.