Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1999-03-23
2003-03-18
Vu, Kim (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06535869
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a computer system, and deals more particularly with a method, system, and computer-readable code for embedding a file index within a data file to which the index pertains.
2. Description of the Related Art
File indexing is well known in the computer art, and provides a technique for randomly accessing data from a file in order to locate data in the file more quickly than would be possible using a sequential search. A file index typically consists of entries containing (1) some type of key value that enables identifying a particular record in the file, and (2) an address that can be used to locate that record on a storage medium. Records in the file are then accessed by searching the index using the key value, and retrieving the associated record address. The key value used may be a value that is uniquely associated with, and forms a part of, each record from the file (such as an employee number in a file of employee records). Or, the key value may be computed using a mathematical algorithm to generate a result that is uniquely associated with a file record. This key generation technique is referred to as “hashing”, and many approaches to hashing are known in the art.
Many different approaches to structuring file indexes are also known in the art. Different structures provide different time estimates for average record retrieval, insertion, and deletion, as well as best case and worst case time estimates. One such index structure is a hash table, which is used with the hashed key values discussed above. Or, various tree structures may be used for a file index. When using a tree structure, each node of the tree typically contains some number of key values and other information related to the keys. As an example of the type of other information, some nodes will contain pointers that must be traversed further when searching for a particular key; other nodes will contain the record address for the record associated with that key (or a pointer to the stored record). Types of tree-structured indexes known in the art include B-trees, tries, B*-trees, and B′-trees. Differences among these tree structures include: how many key values can be stored in a node, whether full or partial keys are stored in the nodes, and whether record addresses are stored in intermediate nodes or only in leaf nodes of the tree. (The specific details of implementing these various tree structures, and advantages thereof, may be found in a data structures reference such as Chapters 7 and 8 of “Algorithms and Data Structures in C++”, Leendert Ammeraal, published by John Wiley and Sons (1996).)
Random-access files are typically stored on a persistent storage medium such as a disk drive. When an executing program needs to access a record located in the file, a portion of the stored disk file containing the record is read into memory. (Typically, a “page” of data is read into memory, where a page is 4096 bytes of data.) If the record is updated by the executing program, then the changed information must be rewritten to persistent storage when the updates are complete. The changed information may be rewritten in the same location where the prior version of the record was stored, or it may be written to a new location. (For example, if the updated record is now larger than the version it replaces, it may be necessary to move the record to a new storage area that can accommodate the larger record.) Moving the location of the record requires changing the record address information in the file index. When a record is to be added to a file, a file manager facility is consulted to locate available free space within the file. The new information is written into an area allocated by the file manager, and an entry for this new record is created in the file index. When a record is to be deleted from a file, the file manager deallocates the file space the record was using, and the entry for the deleted record must be removed from the index. These concepts are known in the art, and file management facilities are readily available.
As can be seen from the discussion above, making changes to records stored in a file (whether through updates, additions, or deletions) requires not only changing the stored record but also typically requires changes to the file index. A file and its index must be kept precisely synchronized as these changes occur, to preserve the integrity of the data. For example, suppose a record is to be deleted from a file. If the deletion completes successfully, but some type of error condition (such as a system failure due to hardware and/or software problems) occurs before the corresponding key is removed from the index, the index no longer accurately represents the stored file. Similarly, if a data record is added to a file but the index is not properly updated with the new record's key, the files are not synchronized and it is not possible to locate the new record using the index. Because updating a file on persistent storage is normally an expensive operation in terms of computing time, file revisions are typically made in a batch mode. In this approach, a number of revised records are kept in faster memory storage until some point in time, and then the entire group or batch of revisions is written to the persistent store at the same time. While this batch approach minimizes the overhead required for revising records, the higher number of records involved results in an increase in the potential for inconsistencies between the data file and its index, should an error occur. Known solutions for dealing with these types of data integrity problems include using transaction processing techniques, where the updating of both the data file and the index file is treated as a single atomic transaction. The updates are only “committed” (i.e. made permanent) to the persistent store if the entire update process completes successfully; otherwise, the updates to both the data file and the index are “rolled back” (i.e. discarded) as a group.
Providing program logic to detect and/or recover from a mismatch between a data file and its index is complex and error-prone. However, inefficiencies in program operation when storing a file and its index separately may occur even in the absence of error conditions. For example, separate allocation requests must be made when creating a data file and its index file. Similarly, when the file is no longer needed, separate deallocation requests must be made for the file and its index. Further, the separate physical locations of the data file and its index on the storage medium may cause processing inefficiencies. This is because accessing an index by an executing program is closely tied to accessing the corresponding data file: normally when a record update or retrieval occurs, the index is accessed and then the data file is accessed. This sequence is then repeated for the next record update or retrieval. If the data file is not stored physically near the index, the storage medium may exhibit thrashing behavior as the read/write mechanism moves back and forth between the two storage areas. Further, if the record address stored in the index is returned to a program, compaction of the data file (so that all unallocated blocks will be located at the end, for example) becomes complicated and error-prone because it may not be possible to determine when such record addresses cease being used by the program. Compaction of the data file requires rebuilding its associated index, so that a record's key entry will then be associated with its new location. If a program continues using a record address that it received prior to the compaction process, it will likely be referring to data other than what it expects (with a high probability of program failure as a result).
Accordingly, what is needed is a technique which avoids the problems discussed above. The present invention addresses these problems by providing a novel solution whereby a file index is embedded within the data file to whi
Doubet Marcia L.
International Business Machines - Corporation
Ray-Yarletts Jeanine S.
Truong Cam Y
Vu Kim
LandOfFree
Increasing efficiency of indexing random-access files... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Increasing efficiency of indexing random-access files..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Increasing efficiency of indexing random-access files... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3039942