Methods of organizing data and processing queries in a...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06711563

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates to relational database management systems (RDBMS), and more particularly to computerized systems for storing and accessing large amounts of data.
In a non-limiting manner, the invention is applicable to “data warehouses”. On-line transaction processing (OLTP) systems, such as for bank teller transactions and airline reservations, are optimized for finding a record associated with a specific key, e.g. finding the information about employee 123124. By contrast, data warehouses are optimized for finding sets of records very quickly. The reason is that typical queries are of the form: “find all sales by region and quarter” or “find stores that sell the greatest volume of sportswear per month” or “select the top 5 stores for each product category for the last year”. Such queries must typically access large sets of rows in data tables. The query processing challenge is to process these queries without doing a linear scan of all or most of the database.
Five main approaches have been proposed to attack this problem: (i) multidimensional arrays; (ii) special indexes; (iii) table caching; (iv) optimized foreign key joins; and (v) approximation.
(i) Multidimensional Arrays (i.e. Matrices).
This strategy consists of implementing the data warehouse as a multidimensional array or matrix. Examples may be found in U.S. Pat. No. 5,359,724 and U.S. Pat. No. 5,864,857. Each dimension corresponds to an attribute of the data. For example, a sales table can be viewed as a matrix with coordinates: store location, product type, customer id, and so on. A particular sale can be identified by specifying all of these attributes. The strategy works well for small databases or very dense ones. By dense, we mean that the Cartesian product of possible values should all be meaningful, e.g., every customer is likely to buy every product from every store. Since this is rarely true, this scheme must be modified to deal with sparse values. This can be done by defining a notion of sparse attributes and dense ones. So, for example, it might be that every store carries every product (a dense relationship that can be stored in a matrix), but only some of these combinations are valid for any given customer. So, a conventional index would be used whenever customer sales are involved, but a dense one for queries involving store-wide or product-wide sales.
(ii) Special Indexes.
Bitmap indexes are an index structure tailored to data warehouses (see, e.g. U.S. Pat. No. 5,903,888). These indexes have already been used in some commercial products to speed up query processing. In its simplest form, a bitmap index on an attribute consists of one vector of bits (i.e. bitmap) per attribute value, where the size of each bitmap is equal to the number of records in the indexed relation. For example, if the attribute is day-of-week, then there would be seven bitmap vectors for that attribute, one for each day. The bitmap vector corresponding to Monday would have a 1 at position i if record i contains “Monday” in the day-of-week attribute. This single value-based approach is called a Value-List index. Other techniques (e.g. U.S. Pat. No. 5,761,652) associate bit vectors with ranges of values, so there could, for a salary attribute, be a vector for the range 0 to 20,000 Euros, 20,000.01 to 35,000 Euros, and so on. Still others associate each bit vector with a bit value (a
1
or a
0
) in a given position. So, if the attribute holds n bit numbers, then there would be 2n bit vectors (position
1
, bit value
1
; position
1
, bit value
0
; position
2
bit value
1
; . . . ).
The benefit of bit vectors is that it is easy to use multiple bit vectors to answer a single query. Consider a query on several predicates, each of which is indexed. Most conventional database management systems would use just one of the indexes (the one that is most “selective” so returns the fewest rows), though some systems might attempt to intersect the record identifiers of multiple indexes.
Bitmaps work better, because they are more compact and intersecting several bitmaps is much faster than intersecting several collections of record identifiers. In the best case, the improvement is proportional to the word size of the machine. For example, suppose the word size is 32 bits. Then two bit vectors can be intersected 32 bits at a time. Each set of 32 bits corresponds to 32 record identifiers being intersected. That best case occurs when each predicate is unselective (i.e. many records match each predicate value), but all the predicates together are quite selective. Consider for example the query: “Find people who have brown hair, glasses, ages between 30 and 40, blue eyes, work in the computer industry, live in California, . . . ”.
So, matrices are best when sets of predicates are dense (all, or nearly all, values in the Cartesian product are possible), bitmaps are best when predicates are neither dense nor individually selective. An intermediate approach (when there is insufficient density for matrices but many values in the Cartesian product are present) is to use multidimensional indexes. Multidimensional indexes such as quadtrees, R-trees and their successors are implemented as variable sized grids on a multidimensional space. The grids are of variable sizes because the population of points differs in different places in a hyperspace. For intuition, consider a map of equi-population rectangles of France. The rectangles would be far more dense in Paris than in the alps. Indexes like this work well for spatial data (where they are used to find the points contained in latitude-longitude quadrants). This alternative is little explored in the commercial arena except for geographical queries, however, because these schemes do not scale well with increasing dimensionality and commercial systems typically have far more than three dimensions.
(iii) Table Caching.
If one doesn't have the luxury to design new indexes on top of a database system (because one is not the implementer of that system) one can pre-compute a large number of anticipated aggregate queries and put them in tables. For example, if a large retailer frequently asks queries that sum the total sales across multiple stores or multiple products, one may store such information in special tables. The main cost of such a strategy is maintaining these tables in the face of updates. (Disk space is no longer a major factor.) In the example, every sale of item I at store S would have to update the total product sales table for I and the total store sales table for S. So, this strategy is worthwhile if there are few updates between queries. The strategy is not worthwhile if there are many.
(iv) Optimized Foreign Key Joins.
Most queries in multidimensional tables entail joins between a central “fact table” (e.g. sales detail) and a set of dimension tables (e.g. store description, product description, customer description). These are known as “foreign key joins” since the customer identifier in the sales table, for example, is a key of the customer description table. (A key is a value belonging to an attribute such that only one record has that value in the attribute.) One way to accelerate these joins is to create a linkage between fact table records and dimension records. This can be done in three basic ways
(a) create an index that holds fact table record identifiers and dimension table record identifiers;
(b) create bidirectional pointers between fact table records and dimension table rows—this is what “object-oriented” databases do;
(c) replace the customer record identifiers in the fact table by offsets into the dimension tables.
Choice (a) is the most independent of changes in the physical organization of the tables and therefore is best for heavily updated systems, because changes to the dimension table can be reflected in the index to that table alone. Choice (b) is the least flexible to physical reorganization, because reorganizing a dimension table would entail updating the fact table. Choice (c) is a compromise of the two in that certai

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Methods of organizing data and processing queries in a... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Methods of organizing data and processing queries in a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods of organizing data and processing queries in a... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3187833

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.