Inverted index storage structure using subindexes and large...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000, C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06349308

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a new inverted index storage structure for dynamic document databases that can maintain document identifier (docID) order in the posting list upon frequent insertion, deletion, and modification of documents and that enables fast search for the posting of a specific document from a long posting list.
The invented storage structure uses large objects and subindexes. A large object stores a database object whose size exceeds one disk page and dynamically adjusts the storage space upon insertion and deletion. A large object typically employs a tree structure for dynamic allocation of space. We store each posting list in a large object and create a subindex on each posting list. A large object manages variable storage space of its posting list upon document addition and deletion. A subindex having docID as the key is used for docID-order preservation of a posting list upon document addition and for removal of a posting from a posting list upon document deletion.
2. Description of the Prior Art
A document can have several attributes such as the document number, modification date, authors, abstract, and body. Some of these attributes have a structured data type such as integer and character string, while others have an unstructured data type such as text, image, audio, and video. Document retrieval based on the attributes of the text data type is traditionally well performed by an information retrieval system, while document retrieval based on the attributes of structured data types by a database management system. Thus, for document retrieval based on the attributes of both text and structured data types, tight coupling of the database management system and information retrieval is necessary. In addition, the database management system can provide consistent management of documents and their indexes upon document addition, deletion, and modification.
Inverted index is an index used for information retrieval, which retrieves documents containing keywords given by a user. Inverted index keeps a list of postings for each keyword extracted from documents. A posting of a keyword consists of a document identifier (docID) and a list of locations in the document where the keyword occurs[Frakes, W. B. and Baeza-Yates, R.,
Information Retrieval—Data Structures
&
Algorithms
, Prentice Hall, Englewood Cliffs, N.J., 1992.; Salton G.,
Automatic Text Processing—The Transformation, Analysis,
and
Retrieval of Information by Computer
, Addison-Wesley Press, 1988.]. The number of postings of a keyword, i.e., the number of documents containing the keyword, is called the document frequency, and the number of occurrences of a keyword in a document is called the term frequency. The document frequency indicates the length of the posting list of a keyword, and the term frequency the length of a posting. Information retrieval using inverted index is done by retrieving the posting list of the keyword given by a user. Thus, inverted index needs a storage structure to find the posting list of a keyword fast.
The storage space for the posting list increases when new postings of an added document are inserted into the posting list, while it decreases when old postings of a deleted document are removed from the posting list. Document modification also affects the length of the posting list because the old document is deleted and the modified document is added. Thus, inverted index needs a storage structure for managing variable storage space of the posting list efficiently.
If posting lists are kept in the order of docIDs, the boolean expression containing more than one keyword can be efficiently processed by sequentially scanning and merging the posting lists of the keywords only once. The query “find the documents about multimedia and database,” for example, contains two keywords of “multimedia” and “database,” and is processed by sequentially scanning and merging the posting lists of “multimedia” and “database.”
In document databases where document addition or modification is frequent, inverted index is dynamically updated on adding or modifying documents. On updating the inverted index, new postings of the added document should be inserted into the posting list in the order of docIDs. Thus, inverted index needs a storage structure to insert new postings efficiently into the posting list in the order of docIDs.
Document deletion is frequent because modifying a document requires deleting the old document before adding the modified document. If a document is deleted from a document database, all the postings of the deleted document should also be removed from the inverted index. Thus, the inverted index needs the following two storage structures: one for finding all the posting lists where the postings of the deleted document are included; the other for searching the posting list to find the posting of the deleted document. Because many keywords having long posting lists are contained in most documents, searching a long posting list or a specific posting is an important part of the performance in document deletion.
The simplest method for storing an inverted index is to store a table of records consisting of a keyword and a posting in a database. This method, however, is known to have low query performance and to require excessive storage space due to redundancy of keywords [Stonebraker, M., “Document Processing in a Relational Database System,”
ACM Trans. on Office Information Systems,
Vol. 1, No. 1, pp. 143-158, 1983; DeFazio, S., Daoud, A., Smith, L. A., and Srinivasan, J., “Integrating IR and RDBMS Using Cooperative Indexing,”
In Proc. Int'l Conf. on Information Retrieval,
ACM SIGIR, pp. 84-92, Seattle, 1995.].
Therefore, a lot of studies have been done on the method using tree structures instead of database tables for storing inverted indexes.
FIG. 1
shows such a conventional inverted index storage structure. The reference numeral
10
shows a B+-tree having the keywords as the key. A pointer to a posting list is stored at the pointer field of the index entry in the leaf node of the tree. The reference numeral
11
shows the storage space for the posting list. Samuel DeFazio et al. [DeFazio, S., Daoud, A., Smith, L. A., and Srinivasan, J., “Integrating IR and RDBMS Using Cooperative Indexing,” In
Proc. Int'l Conf. on Information Retrieval,
ACM SIGIR, pp. 84-92, Seattle, 1995.] argued through experiments that this storage structure out performs the database table method in query performance and storage space. Doug Cutting [Cutting, D. and Pedersen, J., “Optimizations for Dynamic Inverted Index Maintenance,” In
Proc. Int'l Conf. on Information Retrieval,
ACM SIGIR, pp. 405-411, Brussels, 1990.], Christos Faloutsos [Faloutsos, C. and Jagadish, H. V., “On B-tree Indices for Skewed Distribution,” In
Proc. Int'l Conf. on Very Large Data Bases,
pp. 363-374, Vancouver, 1992.], and Anthony Tomasic [Tomasic, A., Garcia-Molina, H., and Shoens, K., “Incremental Updates of Inverted Lists for Text Document Retrieval,” In
Proc. Int'l Conf. on Management of Data,
ACM SIGMOD, pp. 289-300, Minneapolis, 1994.] studied the problem of storage space allocation for the reference numeral
11
when the posting list continuously increases due to document addition.
The conventional storage structures for the inverted index, however, have the following drawbacks: 1) They do not consider maintaining the docID-order in the posting list when inserting a new posting into the posting list upon document addition. Thus, boolean expressions involving more than one keyword are not evaluated efficiently. 2) They consider only adding but not removing the postings from the posting lists. Thus, document deletion is not properly handled. 3) They do not provide fast access to the posting for a specific document in a posting list. They simply assume sequential search, which requires reading half of the posting list on the average.
SUMMARY OF THE INVENTION
T

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Inverted index storage structure using subindexes and large... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Inverted index storage structure using subindexes and large..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Inverted index storage structure using subindexes and large... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2939453

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.