Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-10-13
2002-03-19
Jung, David (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06360213
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates generally to technology used to access and manipulate data in large data sets. More particularly, the present invention is directed to a system and method for automatic and optimal access of data using auxiliary data structures known as indexes. Such indexes permit data queries of large data sets to run faster and require less memory to process.
Indexes are widely used in database tables. Typically, an index is created for one or more columns in a database table. One type of index is the “target index.” An entry in a target index logically consists of a <key:value> pair, where the “key” corresponds to a datum in the data set that is to be indexed and the “value” indicates which record(s) in the data set have this particular key. For example, an index built on an “Age” column might have an entry for the age value
35
in the form: <35:1,4,7>, meaning that records 1, 4, and 7 in the indexed data set have an age value 35.
The above example represents the value as a list of record numbers. However, multiple possible representations for the value list exist. One possible representation is a bit vector or “bitmap,” in which a bit exists for each row in the data set and the n
th
bit being set to one indicates that the n
th
record in the data set has this key value. Depending on the number of distinct keys and the number of records in the data set, the optimal representation may vary. For example, for a small data set with many distinct keys, the representation of a list of row pointers is quite efficient. However, in a large data set with few distinct keys, the bit vector representation is typically more efficient.
In general, the representation chosen for an index strongly depends upon the selectivity that a given key imparts to a query constraint. That is, it depends on the likelihood that use of the key in a query will very specifically select a given record. The term weakly selective describes a constraint that retrieves many records from a table. Weak selectivity typically occurs when a column in a very large table has a small domain (set of possible values). For example, the domain of a Gender column in an “Employees” table consists of only two possible values for every row—Male or Female. Constraints on that column will be weakly selective; they will usually retrieve a very large list of rows.
Larger domains might also give rise to weak selectivity. For example, an Age column in the same table would have a much larger domain than a Gender column, but constraints on age might still be weakly selective, especially if the data is not uniformly spread across the domain or if the constraints specify values that dominate the domain.
A strongly selective constraint will retrieve only relatively few records from a large number of records. For example, a query constraint on a social security number column of the Employees table will generally be strongly selective.
The term “cardinality” is sometimes used to refer to domain size. A column with high cardinality has many possible keys. For example, the cardinality of a social security number is about one billion. In contrast, the cardinality of gender is two (male and female). Generally, though not necessarily, constraints on columns having high cardinality are strongly selective, and constraints on columns having low cardinality are weakly selective.
Conventional B-tree indexes are well suited for columns of high cardinality, while target indexes are well suited for columns having low or intermediate cardinality. As noted, a target index consists of a <key:value> pair.
Some database management systems allow flexibility in representing a target index; that is, various representations are available for the indexes that the database administrator creates. For example, an index on a “Gender” column of a data set may be represented as a bitmap, while the index on an “Age” column may be represented as a compressed list of record IDs. Still further, a column of “zip code” might be stored as an uncompressed list of record IDs. The choice should be driven, at least in part, by the storage space required for the index. A small storage requirement requires less memory to process and generally allows queries run faster (although a compressed list must allow additional processing time for decompression and recompression).
In the Red Brick™ Warehouse relational database management system available from Red Brick Systems, Inc. of Los Gatos, Calif., a user can choose the representation of a target index by specifying the relative domain size for the column being indexed. For example, if the user specifies that the domain is small (e.g., gender), the system will generate a target index in which the list of records is represented as a bitmap. Alternatively, if the user specifies that the domain size is medium (e.g., the attribute is states), the system will build an index in which the value is provided as a compressed row list. Still further, if the user specifies that the domain size is large (e.g., zip code), the system will build an index in which the value is represented as an uncompressed row list. For very large domains, such as social security number, the system should be instructed to build a B-tree index.
While this approach allows the system to intelligently build an index that will usually be optimal for a given attribute, in some cases “data skew” within a given column can degrade the performance of the index. Data skew results when the number of records per key in the index varies widely. For example, in a table of customers, the number of records associated with the state of California can be expected to be much larger than the number of records associated with the state of Delaware (due to the population difference between these two states). Thus, a query constrained to California customers will return a much larger number of records than a query constrained to Delaware records.
The effect of data skew on optimal index representations is illustrated by the following example. Assume that a company has a customer list containing one million records and that the company does business in California and Delaware, among other states. Nine percent of the company's customers are based in California and 0.1 percent are based in Delaware. Therefore, the customer table for this company will contain 90,000 rows in which the customer's state is California and only 1,000 rows in which the customer's state is Delaware. If the company decides to build an index on the state column in its customer table, it might choose either a bitmap representation or a list representation.
Any bitmap representation will include 1,000,000 bits per key (state). Thus, both the California key and the Delaware key will have an associated bitmap of 1,000,000 bits each. Each of those bits is associated with one of the rows in the customer table. A value of 1 at a particular position in the bitmap indicates that the corresponding row of the table contains a record for the key associated with the bitmap. In the case of California, 90,000 of the bits will be 1s, and in the case of Delaware, only 1,000 of the bits will be 1s.
Alternatively, the index on the state column could be represented as a list. Assuming that each row ID in a list representation requires a storage space of six bytes (48 bits), then 4.3 million bits are required to store a list of the 90,000 row-IDs for California customers. Clearly, it is more efficient to have the California key index entry represented as a bitmap.
For Delaware, only 48,000 bits (48 bits/customer×1000 customers) are required to store its index entry in a list representation. As this is significantly less than the 1,000,000 bits required for the bitmap representation, it would be more efficient to represent the value portion of the Delaware customers index entry as a list.
In view of the problem associated with data skew, an improved representation for target indexes would be desirable.
SUMMARY OF THE INVENTION
The present invention addresses t
Balachandran Arunachalam
Byard Jeffrey A.
Fernandez Phillip M.
Johnson Galt
Morandi David L.
Fish & Richardson P.C.
International Business Machines - Corporation
Jung David
LandOfFree
System and method for continuously adaptive indexes does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for continuously adaptive indexes, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for continuously adaptive indexes will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2866817