Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-06-22
2002-07-23
Rones, Charles L. (Department: 2175)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000
Reexamination Certificate
active
06424972
ABSTRACT:
FIELD OF THE INVENTION
The present invention pertains generally to computer-implemented databases, and more particularly to compressing records in such databases.
BACKGROUND OF THE INVENTION
Online analytical processing (OLAP) is an integral part of most data warehouse and business analysis systems. Compared with online transaction processing (OLTP), OLAP services provide for fast analysis of multidimensional information. For this purpose, OLAP services provide for multidimensional access and navigation of the data in an intuitive and natural way, providing a global view of data that can be “drilled down” into particular data of interest. Speed and response time are important attributes of OLAP services that allow users to browse and analyze data online in an efficient manner. Further, OLAP services typically provide analytical tools to rank, aggregate, and calculate lead and lag indicators for the data under analysis.
In OLAP, information is viewed conceptually as cubes, consisting of dimensions, levels, and measures. In this context, a dimension is a structural attribute of a cube that is a list of members of a similar type in the user's perception of the data. Typically, there are hierarchy levels associated with each dimension. For example, a time dimension may have hierarchical levels consisting of days, weeks, months, and years, while a geography dimension may have levels of cities, states/provinces, and countries. Dimension members act as indices for identifying a particular cell or range of cells within a multidimensional array. Each cell contains a value, also referred to as a measure, or measurement.
One issue regarding the design of multidimensional databases is how to store the cell information in the multidimensional space. One potential design choice is to represent the multidimensional space as an array of cells, with the size of the array determined by the multiplication of the number of points in each dimension. A significant problem with this approach is that the size of the database grows exponentially as the number and size of the dimensions increase. This leads to a rapid depletion of the physical resources such as persistent storage and RAM required to implement the database. This phenomenon is known as data explosion for multidimensional databases.
Additionally, space is wasted in the above-mentioned approach, as data in multidimensional databases tends to be sparse. That is, not every cell is expected to have a value or measure associated with it. For example, consider a Store dimension having a hierarchy of Country, State, and City specifying the location of a store, and a Product dimension having a product identification and a product count measure. No store in the database will be expected to stock every possible product, and in fact any one store may only stock a small percentage of the available products. In this situation, most of the cells in the multidimensional space would contain no data, thus wasting much of the space allocated to the database.
Also, almost every cell in a particular column of a multidimensional space may contain exactly the same value. For example, the cell in each record indicating the “Country” field for each store might always contain the same value representing the United States of America. In this case, all of the cells in this column of a multidimensional space could be represented by a single value instead of wasting an entire column of cells in the multidimensional space.
Another issue relates to the capability to perform aggregations on the multidimensional data. Databases are commonly queried for aggregations (e.g. summaries, minimums, maximums, counts, etc.) of detail data rather than individual data items. For example, a user might want to know sales data for a given period of time without regard to geographical distinctions. These types of queries are efficiently answered through aggregations. Aggregations are precomputed summaries of selected detail data that allow an OLAP system or a relational database to respond quickly to queries by avoiding collecting and aggregating detailed data during query execution. Without aggregations, the system needs to scan all of the rows containing the detailed data to answer these queries, resulting in potentially substantial processing delays. With aggregations, the system computes and materializes aggregations ahead of time so that when the query is submitted to the system, the appropriate summary already exists and can be sent to the user much more quickly. Calculating these aggregations, however, can be costly, both in terms of processing time and in terms of disk space consumed.
Many OLAP data stores contain read only data stored as records of a fixed size on write once, read many media such as CD-ROM and DVD, which are well suited to storing the records in a compressed format. Additionally, since compressed data is typically loaded into memory from a secondary storage device (such as a CD-ROM) faster than the equivalent uncompressed data (fewer bits to move) and decompressing methods are performed in memory, accessing and decompressing a compressed record from a secondary storage device can be significantly faster than accessing the equivalent uncompressed record. Also, a compression method of read only data can employ known characteristics of the floating point values to clean any noise created by its storage in a computer. Thus, a compression/decompression method for read only records of a fixed size in an OLAP data store can provide many benefits besides a reduction in the physical size of the stored data.
It is with respect to these considerations and others that the present invention has been made.
SUMMARY OF THE INVENTION
In accordance with one preferred aspect, the present invention relates to a method of compressing data in a plurality of records in a data store. The plurality of records are divided into at least one segment that includes a predetermined number of records that are arranged in a table. Each row represents a separate record and each column represents a particular field in each record. For each column that has floating point data, the floating point data is converted into integer data for each field in the column. The column width is set equal to zero bits for each column that has the same data repeated in each field. Also, for each column having integer data in the field associated with each record, the column width is set equal to the minimum number of bits necessary to represent the largest integer value in the column. Further, a header is included with the segment that indicates the predetermined number of records, the width of each column in the segment, the precision of each conversion from floating point data to integer data, the repeated data, and the original width for each column. When a record is accessed in a compressed segment, the header information is employed to decompress the width of the columns and restore the original data in each field of the accessed record.
In accordance with still other aspects of the invention, each record includes read only data and each record has a fixed size. Also, the type of data associated with each column in the table is determined.
In accordance with yet other aspects of the invention, an exponent of a conversion mechanism is iteratively incremented to determine the minimum precision necessary to convert a particular floating point value into an integer value. The iteration begins with the exponent representing a minimum value for a type of number represented in a computer and ends with the maximum value of this type of number. The number type includes floating point and real.
In accordance with preferred aspects, the present invention relates to a system for compressing and decompressing a plurality of records. A processor is in communication with a device for a computer readable medium. An operating environment executes on the processor from the computer-readable medium in a data store. An OLAP server executes under the control of the operating environment and performs substantially the same
Berger Alexander
Netz Amir
Petculescu Cristian
Branch John W.
Merchant & Gould P.C.
Microsoft Corporation
Rones Charles L.
LandOfFree
Floating point conversion for records of multidimensional... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Floating point conversion for records of multidimensional..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Floating point conversion for records of multidimensional... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2839159