Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2001-10-10
2004-01-13
Vu, Kim (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06678687
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a method for creating an index fast for a full-text search through documents and a method for implementing fast searches by use of the index, and more particularly to the structure of the index.
2. Description of the Prior Art
A method using a data structure called a signature file is available for a full-text search through large volumes of documents. According to the method disclosed in Japanese Published Unexamined Patent Application No. Hei 7-244671, an index is formed to represent the occurrence of characters in documents by bits. This method allows relatively fast searches independently of the number of stored documents.
However, since one bit is assigned to several different words, there is a possibility that a document containing words not specified is searched, posing a problem that a correct search cannot be performed. Also, the complex algorithm of index creation and search has made it difficult to implement fast full-text searches on existing database management systems.
As a solution to such a problem, in the literature “Compression and Fast Indexing for Multi-Gigabyte Text Database”, a method is proposed that implements the functions of a fast full-text search by using indexing methods such as a hash table and B+ tree, as offered by general database management systems. According to this method, an identification number is assigned to a word used as a key and a document used as a value of an index, respectively, and they are compressed before being stored.
This reduces the number of disk pages to be read that are required for searches, enabling fast searches. Also, since different identification numbers are assigned to different words, searches can be performed correctly. This literature shows that about seven hundred thousands of documents can be searched fast.
However, the method described in the above-mentioned literature has a problem that high performance cannot be obtained because the performance of index creation and update processing is not taken into account. Particularly, confirmation processing for avoiding the duplicate registration of an identical document for a certain word during updating cannot be implemented efficiently because it must be performed by repetitive processing on a collection of document identification numbers.
Also, when there are several millions of target documents as in a full-text search through Internet's WWW pages, addition to the index and search processing cannot be performed efficiently even by the method described in the above-mentioned literature because a B+ tree becomes as large as tens of gigabytes.
There are two reasons for this. First, in order to add or search for a set of a word and a document with respect to the B+ tree, a hard disk must be accessed as many times as the height of the tree plus one. Second, if the B+ tree becomes huge, the effect of caching disk contents to a memory would not be obtained and almost all accesses to the hard disk would involve actual reading of data from the hard disk.
For example, if the overall size of a B+ tree is 10 GB, the size of each node of the B+ tree is 8 KB, and the number of branches of each node is 500, then the height (=log 500 (10 GB/8 KB)) of the B+ tree minus 1 is 2.16, indicating that the disk must be accessed two to three times on the average to add or search for a set of a word and a document. Since it takes several tens of milliseconds to several hundreds of milliseconds to access a hard disk once, it would take from 0.1 to 1 second to add or search for a set of a word and a document. Accordingly, if 100 different words are contained in one document, 10 to 100 seconds would be required to register one document.
For these reasons, as the number of data items stored in a B+ tree increases, the B+ tree becomes taller and larger and the number of pages to be accessed increases, so that time required for storage and search becomes longer. For example, as shown in
FIG. 17
, which shows a trend of change of storage time, storage time increases significantly as the number of documents stored increases.
SUMMARY OF THE INVENTION
The present invention has been made in consideration of the above-mentioned conventional circumstances, and it is an object of the present invention to offer a method for creating fast an index for a full-text search through a huge number of documents, for example.
Another object of the present invention is to offer a method for implementing fast searches by using an index thus created.
Specifically, the present invention uses B+ tree index keys with a word identification number followed by a document identification number, whereby the occurrence of a certain word in a given document is made possible by one search through the B+ tree index.
In this case, however, if the index is to be managed by a single B+ tree, the B+ tree would become huge when large volumes of documents are stored, so that, in an attempt to add one document, as many pages as different words contained in the document might have to be updated in the worst case.
To avoid this, the B+ tree index is split to a plurality of subindexes by hash values obtained by applying a certain hash function to word identification numbers and hash values obtained by applying another hash function to document identification numbers and the subindexes are placed in a two-dimensional array. By collectively performing registration for the same hash values of document identification numbers during creation or updating of an index, the number of pages to write to has been reduced and processing efficiency has been improved.
Also, an AND or OR search of a plurality of words is performed in such a way that groups are formed by the hash values of word identification numbers and the B+ trees grouped by the hash values of document identification numbers are collectively searched on a group basis, whereby a hit ratio of page cache during page reading has been improved, thereby achieving higher processing efficiency.
That is, this invention provides a method for creating an index in which keys (e.g., words) and values (e.g., one document containing the words) are associated to search for a value from a specified key, wherein the index to register sets of keys and values are constituted, for example, of a plurality of subindexes of B+ tree structure, each of which is stored in a two-dimensional array position referenced by a value determined by applying a predetermined function to a value to be registered and a value determined by applying a predetermined function to a key.
More specifically, the present invention assigns a document identification number and a word identification number to a document and a word, respectively, to uniquely identify them, provides, as a function to apply to documents, a hash function that maps a document identification number to a value indicating the position of one direction of a two-dimensional array, and as a function to apply to words, a hash function that maps a word identification number to a value indicating the position of another direction of the two-dimensional array, and registers the occurrence of a word in a document in a corresponding subindex by using values obtained by applying the hash functions to the document identification number and the word identification number, respectively.
Also, the present invention assigns a word identification number to a word to uniquely identify it, provides, as a function to apply to the occurrence of word, a hash function that maps the occurrence count or frequency of the word in a document to a value indicating the position of one direction of a two-dimensional array, and as a function to apply to words, a hash function that maps a word identification number to a value indicating the position of another direction of the two-dimensional array, and registers the occurrence of a word in a document in a corresponding subindex by using values obtained by appl
Hayata Hiroshi
Watanabe Yoshiki
Fuji 'Xerox Co., Ltd.
Oliff & Berridg,e PLC
To Baoquoc N
Vu Kim
LandOfFree
Method for creating an index and method for searching an index does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for creating an index and method for searching an index, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for creating an index and method for searching an index will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3238351