Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1999-03-15
2003-04-01
Alam, Hosain T. (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000
Reexamination Certificate
active
06542886
ABSTRACT:
TECHNICAL FIELD
The present invention relates generally to the field of database systems. More particularly, the present invention relates to the field of sampling records in a database system.
BACKGROUND OF THE INVENTION
Computer database systems manage the storage and retrieval of data in a database. A database comprises a set of tables of data along with information about relations between the tables. Tables represent relations over the data. Each table comprises a set of records or tuples of data stored in one or more data fields. The records of a table are also referred to as rows, and the data fields of records in a table are also referred to as columns. A database server processes data manipulation statements or queries, for example, to retrieve, insert, delete, and update data in a database. Queries are defined by a query language supported by the database system.
For large databases such as data warehouses, for example, typical tools such as On Line Analytical Processing (OLAP) and data mining serve as middleware or application servers that communicate data retrieval requests to a backend database system through a query. Although the cost of executing ad-hoc queries against the backend can be expensive, many data mining applications and statistical analysis techniques can use a sample of the data requested through the query. Similarly, OLAP servers that answer queries involving aggregation (e.g., “find total sales for all products in the NorthWest region between Jan. 1, 1998 and Jan. 15, 1998”) benefit from the ability to present to the user an approximate answer computed from a sample of the result of the query posed to the database.
Sampling is preferably supported not only on existing stored or base relations but also on relations produced as a result of an arbitrary query. Samping may be supported in relational databases as a primitive operation SAMPLE(R,f), for example, to produce a sample S of r tuples that is an f-fraction of a relation R. Fully evaluating a query Q to compute relation R only to discard most of relation R when applying SAMPLE(R,f), however, is inefficient. Preferably, query Q may be partially evaluated so as to produce only sample S of relation R.
For a given query tree T for computing a relation R that is the result of a query Q where SAMPLE(R,F) is the root or last operation of query tree T, pushing the sample operation down tree T toward its leaves would help minimize the cost of evaluating query Q as only a small fraction of stored and/or intermediate relations would be considered in evaluating query Q. The ability to commute the sample operation in this manner, however, depends on the relational operations used in query tree T. The standard relational operation of selection can be freely interchanged with sampling. With join operations, however, sampling may not be so easily commuted.
FIG. 1
 illustrates a query tree 
100
 for obtaining a sample of a join of operand relations R
1 
and R
2
. Query tree 
100
 is executed in accordance with a flow diagram 
200
 of FIG. 
2
. For step 
202
 of 
FIG. 2
, a relation J is computed by joining R
1 
and R
2
, or J=R
1
∞R
2
. For step 
204
, r tuples are randomly sampled from relation J to produce a sample relation S. Commuting the sample operation in query tree 
100
 to operand relations R
1 
and R
2
, as illustrated by a query tree 
300
 in 
FIG. 3
, would minimize the cost of obtaining a join sample because only samples of operand relations R
1 
and R
2 
would need to be joined. A join of samples of operand relations R
1 
and R
2
, however, will not likely give a random sample of the join of operand relations R
1 
and R
2
.
As one example:
R
1
(A,B)={(a
1
,b
0
), (a
2
,b
1
), (a
2
,b
2
), (a
2
,b
3
), . . . , (a
2
,b
n
)}
and
 R
2
(A,C)={(a
2
,c
0
), (a
1
,c
1
), (a
1
,c
2
), (a
1
,c
3
), . . . , (a
1
,c
n
)}.
That is, relation R
1 
is defined over attributes A and B. Among the n+1 tuples of relation R
1
, one tuple has an A-value a
1 
and n tuples have an A-value a
2
, but all n+1 tuples of relation R
1 
have distinct B-values. Similarly, relation R
2 
is defined over attributes A and C. Among the n+1 tuples of relation R
2
, n tuples have an A-value a
1 
and one tuple has an A-value a
2
, but all n+1 tuples of relation R
2 
have distinct C-values.
Computing the equi-join of relations R
1 
and R
2 
over attribute A produces the following relation:
J=R
1
∞R
2
={(a
1
,b
0
,c
1
), (a
1
,b
0
,c
2
), (a
1
,b
0
,c
3
), . . . , (a
1
,b
0
,c
n
), (a
2
,b
1
,c
0
), (a
2
,b
2
,c
0
), (a
2
,b
3
,c
0
), . . . , (a
2
,b
n
,c
0
)}.
That is, relation J has n tuples with A-value a
1 
and n tuples with A-value a
2
.
About one half of the tuples in a random sample S of relation J, or S⊂J, would likely have an A-value of a
1 
while the remaining tuples would have an A-value of a
2
. A random sample S
1 
of relation R
1
, or S
1
⊂R
1
, however, would not likely comprise tuple (a
1
,b
0
), and a random sample S
2 
of relation R
2
, or S
2
⊂R
2
, would not likely comprise tuple (a
2
,c
0
). The join of samples S
1 
and S
2 
would then likely comprise no tuples and therefore would not likely give random sample S of relation J.
One prior sampling strategy for obtaining a sample S of a join of two relations R
1 
and R
2 
with respect to a join attribute A is illustrated as a flow diagram 
400
 in FIG. 
4
.
For notational purposes, relations R
1 
and R
2 
have sizes n
1 
and n
2
, respectively. The domain of join attribute A is denoted by D. For each value v of domain D, or v&egr;D, m
1
(v) and m
2
(v) denote the number of distinct tuples in relations R
1 
and R
2
, respectively, that contain value v in attribute A. Then, &Sgr;
v&egr;D 
m
1
(v)=n
1 
and &Sgr;
v&egr;D 
m
2
(v)=n
2
. A relation J results from the computation of the join of relations R
1 
and R
2
, or J=R
1
∞R
2
, and n is the size of relation J, or n=|J|=|R
1
∞R
2
|. Then, n=&Sgr;
v&egr;D 
m
1
(v)m
2
(v). For each tuple t of relation R
1
, the set of tuples in relation R
2 
that join with tuple t is denoted as J
t
(R
2
)={t′&egr;R
2
|t′.A=t.A}; t∞R
2 
denotes the set of tuples in R
1
∞R
2 
obtained by joining tuple t with the tuples in J
t
(R
2
); and |t∞R
2
|=|J
t
(R
2
)|=m
2
(t.A). Similarly for each tuple t of relation R
2
, J
t
(R
1
)={t′∞R
1
|t′.A=t.A}; R
1
∞t denotes the set of tuples in R
1
∞R
2 
obtained by joining tuples in J
t
(R
1
) with tuple t; and |R
1
∞t|=|J
t
(R
1
)|=m
1
(t.A).
For step 
402
 of 
FIG. 4
, a variable r is initialized to the size of a sample relation S to be obtained from the join of relations R
1 
and R
2
. For step 
404
, a variable M is initialized to the upper bound on the number of join attribute values v in relation R
2 
for all values v of domain D on attribute A. That is, M is the maximum number of any one join attribute value in relation R
2
. A tuple t
1 
is randomly sampled from relation R
1 
for step 
406
. A tuple t
2 
is then randomly sampled for step 
408
 from among all tuples of relation R
2 
having a join attribute value t
2
.A that matches the join attribute value t
1
.A of tuple-t
1
. For step 
410
, a tuple T is computed as T=t
1
∞t
2 
and output for sample relation S with a probability based on the number of tuples in relation R
2 
having a join attribute value that matches that of tuple t
1 
divided by M, or m
2
(t
2
.A)/M. If not output, the sample tuple t
1 
is rejected for step 
410
. If r tuples have not yet been output for sample relation S as determined for step 
412
, steps 
406
 through steps 
412
 are then repeated until r tuples have been output to form sample relation S as determined for step 
412
. Flow diagram 
400
 then ends for step 
414
.
The sampling technique of 
FIG. 4
 in
Chaudhuri Surajit
Motwani Rajeev
Narasayya Vivek
Alam Hosain T.
Microsoft Corporation
Nguyen Tam
Watts, Hoffmann, Fisher & Heinke Co. LPA
LandOfFree
Sampling over joins for database systems does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Sampling over joins for database systems, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Sampling over joins for database systems will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3071611