Data processing: database and file management or data structures – Database design – Data structure types
Utility Patent
1998-10-26
2001-01-02
Amsbury, Wayne (Department: 2771)
Data processing: database and file management or data structures
Database design
Data structure types
C711S100000
Utility Patent
active
06169990
ABSTRACT:
BACKGROUND OF INVENTION
The present invention relates to computer databases and in particular to a method of constructing a compressed database, for example so that the database can be stored in solid state memory.
Modern information storage systems are required to handle large amounts of data in a cost effective manner and are designed to allow access to the stored data easily and quickly. Considerable research has been performed on databases with most attention being focused on ferro-magnetic disc resident databases which are deemed suitable for handling large amounts of data due to the relatively low cost of the technology. Unfortunately, even with the use of the latest technology, individual read/write operations involving disc resident databases are inherently slow due to the mechanical nature of the disc system. In order to improve disc resident databases therefore, much attention has been focused upon increasing the speed of the databases by alleviating the disc input/output bottleneck.
In its simplest form a computer database is typically arranged in the form of a table which is stored in a continuous area of disc memory (typically the tables are stored row after row in a sequence of consecutive memory locations). The table is divided into a series of rows and columns to provide, at the intersection of the rows and columns, locations in which individual data values can be stored. The values in a given row in combination provide a “tuple” which is unique to that row. For example, in a typical employee database storing information concerning say 100 employees, the database would typically comprise 100 tuples, each tuple being defined over a plurality of fields, for example, the fields of name, age, employee number etc.
Databases are often organised as a set of linked tables in a format known as a relational database. A relational database consists of a set of stored data representations—the relations—and a set of operations defined to operate on the relations. Each relation is a set of tuples and the structure of a relation is illustrated in FIG.
1
. Each tuple is unique, and is a concatenation of an ordered set of data values. The data values in a particular field of all the tuples of a particular relation form a column of the relation but also belong to a domain (i.e. a set of data values greater than or equal to the number of data values in that column of the relation). Different columns from the same or different relations may belong to the same domain, e.g. where all these columns contain say dates. In a conventional implementation, the relations are in the relational store, with the tuples implemented as records directly mapping the input form of the information. A ‘relational processor’ carries out the defined relational operations, to and from input/output devices and the relational store.
The actual data values are physically stored in an ordered aggregate record. Thus a conventional database system must allocate sufficient storage space to allow for the largest storage representation resulting in a considerable waste of storage for all but the exceptional tuple.
An alternative organisation is to use a separate dictionary to store variable length data values, representing these in the tuple field by means of a token supplied from the dictionary. The full data item can be retrieved from the dictionary when required, for processing or output. This more space-efficient method is not adopted by most conventional systems because it requires multiple accesses to the dictionary for every tuple input, output, and update, and also for processing queries. Particularly where queries involve fields with large numbers of different tokens, such look-ups are prohibitively inefficient with a conventional disc-based dictionary, while RAM-based dictionaries have required too much space to be economic.
Processing of relational databases requires the location, matching and selection of tuples and their combination by a defined set of operations—the relational operations. Typically these require the selection of tuples with some specified field values, or ranges of values, while other fields may be unspecified.
Indexes are frequently employed, particularly to give rapid access to fields frequently specified in queries. Indexes thus speed access when data is to be retrieved, but have a cost in terms of the space they occupy, and in the time and processing required in their construction as data is added or updated.
Another consideration has been the volatility of information held in high-speed RAM storage. As security of the data is paramount in database systems, conventional systems have not regarded data as ‘secure’ until stored on non-volatile backing-store, i.e. the backing-store has been regarded as the normal location of data, which is only brought into high-speed store transiently for processing or update. Though secure, this represents a time overhead on processing, which compounds the difficulties of ensuring the isolation of one transaction from another. Locking has had to be of fine granularity, and is a heavy overhead.
The conventional system is thus constrained to use slow disc-based storage (because of the large cost advantage of disc technology over more rapid RAM or other solid-state technologies), is forced to use that storage in a non-compact way to avoid the overload of dictionary look-up, is forced to construct maintain and store indexes to speed access performance, and is forced into a costly locking scheme to maintain consistency. The net result is that conventional systems require large volumes of disc-based storage and their operation is very slow (relative to RAM-based computation operations).
While the slow speed of operation of the disc-based conventional system can in principle be overcome by replacing the discs by much faster electronic technology, this has so far been economically unattractive (except for special purpose small, high-speed applications) because of the ×10 to ×100 higher price of RAM.
As already outlined above, the main problem with disc based databases is the slow speeds at which the disc can be accessed. For example, using the example of the employee database given above, if it is required to extract information on an employee named “John Smith” from the database it is necessary to retrieve tuples having the data entry “John Smith” in their name field. Such a search may involve a large number of individual read and comparison operations for each individual search.
A number of techniques have been employed by database designers to increase access speeds to disc resident databases. One technique in common usage is that known as ‘hashing’. Hashing requires the original data to be stored in a less compact form, or alternatively with additional indices known as hash access vectors. As indicated above, disc storage space is considered to be relatively cheap and the increase in the size of the database produced by hashing is generally thought to be an acceptable trade-off for the resulting increase in access speed. Hashing is an example of the need to store the original data in a non-compact form.
The foregoing descriptions illustrate the inherent conflict between the aim of compressing a database whilst simultaneously increasing access speeds.
SUMMARY OF INVENTION
It is a first object of the present invention to provide a method of constructing a database which optimises data compression ratio and access speed.
It is a second object of the present invention to provide a database construction method which optimises the database creation rate and minimises the amount of memory space required during creation.
It is a third object of the present invention to provide a database structure which minimises the requirements for memory reorganisation operations when both creating and modifying the database.
According to a first aspect of the present invention there is provided a method of constructing a computer database for storing in compressed form information comprising a plurality of tuples, wherein each tuple comprises a data value in
Cockshott William Paul
McGregor Douglas Robert
Wilson John Nugent
Alston & Bird LLP
Amsbury Wayne
University of Strathclyde
Wang Mary
LandOfFree
Databases does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Databases, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Databases will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2515689