Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-11-17
2001-12-25
Black, Thomas (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06334125
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates generally to methods for loading data into data structures, which include aggregated data at multiple levels, and more particularly to a method for loading data into a cube forest data structure, which includes aggregated data at multiple levels, that enables quick searching for particular aggregates and the compact storage of this data.
High corporate and government executives must gather and present data before making decisions about the future of their enterprises. The data at their disposal is too vast to understand in its raw form, so they must consider it in summarized form, e.g., the trend of sales of a particular brand over the last few time periods. “Decision support” software to help them is often optimized for read-only complex queries.
In general, there are four main approaches to decision support:
1. Virtual memory data structures made up of two levels in which the bottom level is a one or two dimensional data structure, e.g., a time series array or a spreadsheet-style two dimensional matrix (time against accounts). This can be generalized so that the bottom level index can have many dimensions. These are the dense dimensions (the ones for which all possible combinations exist). The sparse dimensions are at a higher level in the form of a sparse matrix or a tree. Queries on this structure that specify values for all the sparse dimensions work quite well. Others work less optimally.
2. Bit map-based approaches in which a selection on any attribute results in a bit vector. Multiple conjunctive selections (e.g., on product, date, and location) result in multiple bit vectors which are bitwise AND'd. The resulting vector is used to select out values to be aggregated. Bit vector ideas can be extended across tables by using join indexes.
3. One company has implemented specially encoded multiway join indexes meant to support star schemas (see, e.g., their website at http://www.redbrick.com/rbs/whitepapers/star\_wp.html). These schemas have a single large table (e.g., the sales table in our running example) joined to many other tables through foreign key join (to location, time, product type and so on in our example). This technique makes heavy use of such STARindexes to identify the rows of the atomic table (sales table) applicable to a query. Aggregates can then be calculated by retrieving the pages with the found rows and scanning them to produce the aggregates.
4. Another technique is known as materialized views for star schemas. In this technique a framework is used for choosing good aggregate views to materialize. Each aggregate view corresponds to a node in our template trees. The cost model in this technique disregards the possibility of indexes in measuring query cost. Further, this technique applies directly to current relational systems (or at least current relational systems that support materialized views). The algorithm used by this technique is a simple competitive greedy algorithm for optimizing the views to materialize (subject to the cost model), with guaranteed good properties. It goes like this: suppose that S is a set of views that might be queried. Now consider various views to materialize (including views outside S). Materialize the view that gives maximum benefit according to the author's cost model. While the author's cost model does not discuss indexes, recent work by others gives an algorithm that automatically selects the appropriate summary tables and indexes to build. The problem is NP-Complete, so heuristics are provided, which approximate the optimal solution extremely closely. This approach of a summary table and index approach as measured by space use or update time is relatively inefficient.
Yet another technique is massive parallelism on specialized processors and/or networks. Teradata® and Tandem® use this approach, making use of techniques for parallelizing the various database operators such as select,joins, and aggregates over horizontally partitioned data. Masspar, by contrast, uses a SIMD model and specialized query processing methods.
None of these techniques provides the efficiency and query aggregation speed desired by users as data size continues to grow and search engines become more and more sophisticated.
The present invention is therefore directed to the problem of developing a method for structuring data that enables the data to be stored in a storage medium in a compact form, yet permits rapid queries and aggregation of data.
SUMMARY OF THE INVENTION
The present invention solves this problem by providing a data structure known as a cube forest for use in a batch-load-then-read-intensively system. As a result of the present invention, the time to execute a bit vector query is significantly improved. Hierarchically split cube forests provide a method for efficiently duplicating information, and can be optimized to reduce update and storage costs. In summary, cube forests are most appropriate for read-intensive, update-rarely-and-in-large-batches multidimensional applications in an off-the-shelf (low cost) hardware environment.
According to the present invention, a method for structuring data with i key attributes (A
1
, . . . , A
i
) and additional value attributes for storage in a memory includes the steps of: a) defining a first forest F
1
as a single node labeled A
1
; b) constructing a subsequent forest F
n
according to the substeps of: (i) creating a node n; (ii) copying a previous forest F
j−1
; (ii) making each tree in the previous forest F
j−1
a subtree of the node n; (iv) creating another copy of the previous forest F
j−1
; and (v) defining the subsequent forest F
i
as a union of the previous forest F
j—1
and a tree rooted at the node n; and c) repeating step b) i−1 times until F
i
is constructed, wherein the data structure is F
i
. According to the present invention, the i key attributes can either be orthogonal attributes or nonorthogonal attributes.
Furthermore, the above method of the present invention is particularly advantageous when the paths in F
i
represent keys of identifying data records.
Further, according to the present invention, an index structure for storing and indexing aggregates over at least i key attributes (A
1
, . . . , A
i
) includes a plurality of i well-ordered trees, wherein a first tree includes one template node, and a next tree in the order includes a root template node having branches to duplicates of each of the previous trees, a total number of the template nodes is equal to 2
n
−1, 2
n−1
of which are leaf nodes, and a collection of trees represents a template for building a set of search structures on a data table, and an index subkey is a catenation of attributes from a template tree root to a node.
Another aspect of the present invention includes a data storage device that includes a data structure that conforms to the rules of a full cube forest over key attributes (A
1
, . . . , A
i
) and a means for storage of an aggregation of values at each node of the full cube forest, in which one aggregate value for each subkey is represented by the node and which appears in the data.
Yet another aspect of the present invention provides a method for structuring data comprising the steps of: a) organizing the data as a cube forest by creating a cube forest template for the data; b) creating an index on each tree within the cube forest template; c) for each template, choosing a path from a root of the template to be a spine of the tree, wherein the spine defines a composite index, and the index has a plurality of keys which are attributes of nodes in the spine concatenated together, whereby the spine partitions the h-split template, creating several subtrees; d) establishing a spine for each subtree; and e) continuing steps a) through d) until all template nodes are in some spine.
Another aspect of the present invention provides a method for designing a cube forest data structure for a given cube forest template F, which has a plurality of trees, said method comprising
Johnson Theodore
Shasha Dennis
AT&T Corp.
Black Thomas
Rones Charles L.
LandOfFree
Method and apparatus for loading data into a cube forest... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for loading data into a cube forest..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for loading data into a cube forest... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2577373