Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-07-05
2003-04-15
Vu, Kim (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06549895
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to database management systems, and more particularly to a method and apparatus for analyzing data retrieval using index scanning.
BACKGROUND OF THE INVENTION
Database management systems are well-known in the art and comprise a computer system for recording and maintaining data. Typically, the database management system (DBMS) comprises the combination of an appropriate computer, direct access storage devices (DASD) or disk drives, and database management software. A relational database management system is a DBMS which uses relational techniques for storing and retrieving information. The relational database management system or RDBMS, such as the DB2 product from IBM, comprises computerized information storage and retrieval systems in which data is stored on disk drives or DASD for semi-permanent storage. The data is stored in the form of tables which comprise rows and columns. Each row or tuple has one or more columns.
The RDBMS is designed to accept commands to store, retrieve, and delete data. One widely used and well-known set of commands is based on the Structured Query Language or SQL. The term query refers to a set of commands in SQL for retrieving data from the RDBMS. The definitions of SQL provide that a RDBMS should respond to a particular query with a particular set of data given a specified database content. SQL however does not specify the actual method to find the requested information in the tables on the disk drives. The SQL requests are nonprocedural, also referred to as declarative, and users simply specify what is wanted, rather than specifying how to accomplish it. There are many ways in which a query can be processed and each consumes a different amount of processor and input/output access time. The system's optimizer must determine the optimal way (or access path) to get the data to the user. One way to access data is to sequentially scan every row in a table for those rows which match the search criteria. This type of scan is known as a table scan, because the entire table is scanned in sequence from beginning to end.
Rows of data are stored on pages on physical storage devices, usually disk drives or files. Data is transferred between the physical storage and the computer system's processing unit page by page even though only a single row may be needed from a given page. The time it takes to transfer data between physical storage and the processing unit is usually many times greater than the time it takes to process the data in the processing unit. To manipulate data in a relational database system, the rows must first be transferred from physical storage to the processing unit, then processed in the processing unit, and finally transferred back to physical storage. Because transferring takes so much longer than processing, the total time required to manipulate the data can be dramatically reduced if the number of transfers can be reduced.
Most relational database systems maintain indexes for their tables. An index is a list stored separately from the rows and used to access the rows in a selected order. An index comprises many index entries; each index entry includes a key value and an identifier of or pointer to a row which contains that key value. Indexes are physically stored on index pages. To scan a table's rows using an index, the index entries are read sequentially and pointers in the entries are used to access the rows in the index's order. This type of scan is called an index sequential scan, or index scan for short. In general, a database system's optimizer has to choose the best scan among table scans and various index scans. Hence, the optimizer needs to be aware of factors that affect the number of page transfers required.
The prior art teaches that the number of page transfers for an index scan depends not only on the number of data pages but also on the organization of the index. For example, the prior art teaches that an index scan through a perfectly clustered index (also referred to as a clustered index scan) is often preferable to a table scan. An index is perfectly clustered if, in an index scan, each data page is accessed only once. For this to occur, the data rows, when accessed in index order, must be in the same sequence as the sequence in which they are stored in the data pages of physical storage. An index scan through a clustered index is fast because the number of data page accesses is minimized since duplicate accesses to the same data page are unnecessary.
The prior art also teaches that an index scan through a non-clustered index (also referred to as a non-clustered index scan) is often undesirable. An index is non-clustered if, in an index scan, the data pages are accessed back and forth at random. Index scans through non-clustered indexes are extremely slow, because there is much thrashing back and forth between data pages as the index requires separate data pages to be randomly accessed and transferred into and out of the processing unit's main memory but accesses only one row from the many on each such page.
U.S. Pat. No. 5,043,872, which issued Aug. 27, 1991 to International Business Machines Corporation, discloses a method for measuring the degree of clustering of an index as a single value known as a clustering coefficient. This clustering coefficient can be used by a database system's optimizer to estimate the number of page transfers for an index scan.
U.S. Pat. No. 5,778,353, which issued Jul. 7, 1998 also to International Business Machines Corporation, discloses that the number of page transfers for an index scan depends not only on the organization of the index but also on the configuration of the database system. In particular, the number of transfers depends on the amount of the processing unit's main memory that is dedicated to caching data pages from a table. This memory cache is usually referred to as a buffer pool. The invention disclosed in U.S. Pat. No. 5,778,353 models the relationship between the size of the buffer pool and the number of transfers required for a scan of any selected index. This relationship is sometimes called Full index scan Page Fetch or FPF information.
In a traditional database management system, the database administrator or user can issue a request to collect statistics so that the system's optimizer has up-to-date information on tables and indexes. For example, DB2 Universal Database provides a command called “runstats” that allows users to collect indexes' clustering coefficients, among other statistics. DB2's runstats command also gives users the option of collecting FPF (i.e. Full index scan Page Fetch) information. Prior versions of DB2 (i.e. versions 6.1 and older) compute the FPF information by simulating a set of buffer pools while scanning an index. For each index entry, each simulated buffer pool is examined to determine if the index entry requires a data page transfer. The FPF information thus gathered is accurate, but the process of computation is extremely time-consuming. In contrast, the process of computing the clustering coefficient is very efficient, but can often yield poor estimates of the number of page transfers.
Accordingly, there remains a need for a mechanism for analyzing data retrieval requirements in a database management system.
SUMMARY OF THE INVENTION
The present invention provides mechanism for estimating the number of page transfers required to scan any selected index for any given buffer pool size. A typical use of this invention is in the statistics collection facility of a database management system.
Advantageously, the mechanism according to the present invention provides substantial increases in performance over the Full index scan Page Fetch (FPF) information collection method used in prior versions of DB2 Universal Database, i.e. versions 6.1 and older. Also, because the mechanism accounts for the buffer pool size, it is capable of much more accurate estimation than prior methods that fail to use information about the buffer pool.
International Business Machines - Corporation
Liang Gwen
Sawyer Law Group LLP
Vu Kim
LandOfFree
Method and apparatus for analyzing data retrieval using... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for analyzing data retrieval using..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for analyzing data retrieval using... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3013142