System and method for performing I/O-efficient join processing

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06282533

ABSTRACT:

CROSS REFERENCES TO RELATED APPLICATIONS
This application is related to another U.S. patent application Ser. No. 09/163,939, now abandoned, and entitled “SYSTEM AND METHOD FOR PERFORMING I/O-EFFICIENT BATCHED SEARCHING” having common inventors and a common assignee and U.S. patent application Ser. No. 09/163,943, pending, and entitled “SYSTEM AND METHOD FOR PERFORMING SCALABLE SWEEP BASED SPATIAL JOIN” having common inventors and a common assignee, both incorporated by reference herein.
FIELD OF THE INVENTION
The present invention is directed towards I/O-efficient join processing and in particular, towards join operations in temporal, spatial and constraint databases.
BACKGROUND OF THE INVENTION
Input and output (I/O) limitations can be a critical aspect in achieving acceptable system performance in many large scale applications such as those arising in VLSI and CAD design, spatial databases and geographic information systems. Even though CPU speeds have, over the past decade or so, increased at an annual rate of 40-60%, disk transfer rates have only increased 7-10% annually. Increases in internal memory sizes have not been enough to keep pace with these large applications which generate enormous amounts of data.
Data will, therefore, not always fit in internal memory. Secondary or external memory is therefore needed to accommodate the large blocks of produced data. The successful realization of any data model in a large-scale database requires supporting its language features with efficient secondary-storage manipulation. Along with retrieval, the join is one of the most I/O-intensive operations in database systems. As the join operation is also one of the fundamental relational database query operations, a considerable amount of research has been undertaken in an attempt to develop I/O efficient join techniques in the relational database model.
The join operation facilitates the retrieval of information from two different sets based on the Cartesian product of the two sets. Specifically, the join operation is used to combine tuples (rows of information) from two or more sets based on common information. Efficient implementation of the join operation is difficult since no predefined links between the sets are required to exist.
In this context, the standard measure of efficiency for a join operation is the number of I/O operations performed by the technique in question. Moreover, it is important to develop techniques that have provably good worst-case bounds and that are also efficient in practice. The most efficient technique has a lower bound of &OHgr;(n log
m
n+t) I/O operations or running time. This is implied by the well-known lower bound for sorting in external memory, where B is the units of data per disk block, M is the total amount of main memory, N is the units of data in the problem, T is the units of data in the solution, n=NIB, t=TIB, m=M/B and
log
m

n
=
max
(
1
,
log



n
log



m
}
.
See A. Aggarwal & J. S. Vitter, “The I/O log m Complexity of Sorting and Related Problems,” Communications of the ACM 31(9) (1988), pps. 1116-1127.
The general data structure problem in many data models is the storage and manipulation of d-dimensional rectangles. The term rectangle is used in a general sense to denote intervals in one dimension, rectangles in two dimensions and hyper-rectangles in d dimensions. It is also assumed that the sides of the rectangles are parallel to the coordinate axes. As such, indexing or retrieval in many data models reduces to d-dimensional range searching over d-dimensional rectangles. The join in many data models can be defined as the intersection between two sets of rectangles in d dimensions. Although other definitions of the join are possible, based on inequalities, dominance, or proximity, the intersection-based join problems are good representatives of join problems.
In one dimension, the join is simply the set of intersections between two sets of intervals. This problem is the prototypical join problem in temporal and constraint data models. The prior art techniques used to solve the interval join problem suffer various drawbacks.
On-line interval intersection occurs where a data structure is built on an input set of intervals and then queried (or updated) in an on-line fashion. That is, the results of a query have to be returned before the next query is processed. This problem has been extensively studied both in main memory and secondary storage. In particular, the open problem of whether it is possible to build a dynamic, worst-case optimal data structure for this problem was recently resolved. See L. Arge & J. S. Vitter, “Optimal Dynamic Interval Management in External Memory,” IEEE Symp. on Foundations of Comp. Sci. (1996). However, directly applying the on-line intersection technique to the one-dimensional join problem results in a running time of O(N log
B
n+t) I/O operations, which is a factor of &THgr;(B (log m)/(log B)) away from the optimal.
Off-line interval intersection, where a stream of queries is submitted to a data structure which processes them in a “lazy” fashion, has also been studied. Directly applying the off-line data structure to the one-dimensional join problem results in a technique that is optimal with respect to the running time, but uses a non-optimal O(n log
m
n) disk blocks of storage.
An asymptotically optimal (but somewhat impractical) method for the one-dimensional join problem can be obtained by using the reduction of interval intersection to two-dimensional range searching and by using the batch two-dimensional range query techniques. On the other hand, the approaches in the temporal database literature for interval join are based on heuristics and do not offer good asymptotic worst-case bounds.
In two dimensions, the join between two relations is the intersection between rectangles in the plane. This join problem has a very elegant solution in main memory that uses priority search trees and plane sweeping to achieve an optimal running time. In secondary storage, this problem can be solved in an asymptotically optimal way by reduction to the problems of line segment intersection and batched range searching. This solution, however, in practice will be much more inefficient than that of the methodology of the invention as described below.
The general problem of finding intersections between two sets of d-dimensional hyper-rectangles has also been studied. Specifically, the prior art techniques focus on efficient internal-memory methods for reporting intersections between rectangles in d-dimensional space. The fastest currently known internal-memory method runs in O(Nlog
2
(d−1)
N+T) time and linear space, and several earlier results exhibited some additional logarithmic factors in time as well as in space.
SUMMARY OF THE INVENTION
The invention provides I/O-efficient methods for the d-dimensional rectangle-join problem in one, two, three and arbitrary higher dimensions. The methods enable I/O-efficient processing for join problems in temporal, spatial, and constraint databases. Importantly, for one and two dimensions, the present invention methods are I/O-optimal.
Advantageously, a relatively simple and elegant optimal method is provided for the one-dimensional join problem. In an exemplary embodiment of the one-dimensional join method, after an initial sort of the two input relations, the method uses a single scan of the sorted relations, in which it maintains two simple list structures called I/O-lists, to produce the output. Importantly, the method outputs all intersections only once. This method is very amenable to optimizations that further reduce the number of I/O operations needed, thus making it an ideal candidate for practical use.
Additionally, an optimal solution for the two-dimensional join problem is provided, based on the above I/O-list structure together with distribution-sort and distribution-sweep techniques. Specifically, by adjusting the fan-out of the recursion with respect to the number of I/O-lists that hav

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for performing I/O-efficient join processing does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for performing I/O-efficient join processing, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for performing I/O-efficient join processing will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2469341

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.