Dynamic indexing information retrieval or filtering system

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000, C707S793000, C711S100000, C711S170000, C711S171000, C711S173000

Reexamination Certificate

active

06687687

ABSTRACT:

TECHNICAL FIELD
The present invention relates generally to information retrieval or filtering systems and more particularly to methods for dynamically indexing words contained in a set of documents in information retrieval or filtering system.
BACKGROUND OF THE INVENTION
Information retrieval or filtering systems generally employ an index file that indexes information stored in a database. The index file is used to locate information in the database. The index file contains reference information for respective words, where for each word, the reference information points to occurrences of the word in documents stored in the database. The reference information for a word is also referred as “postings” of the word.
Most indexing techniques are “static” because the indexing employed in such techniques is performed in two phases. In the first phase of the indexing, input files are usually read to build some temporary internal files. In the second phase of the indexing, the temporary internal files are optimized to prepare for retrieval. Hence, the indices are static once the optimization is complete. That is, it is impossible to add new documents without rebuilding the whole index. Queries for the retrieval of documents cannot be completed until the second phase of the indexing is performed.
Dynamic indexing techniques have been introduced to overcome the limitations of static indexing techniques. Indexes are accumulated in an index file that is checked without the optimization of an internal file at each time for retrieval queries. In the conventional dynamic indexing technique, the index file is organized into a set of fixed length of blocks where postings for words are stored. The blocks pack postings for several words together with more or less free space. An address record table is kept to store the block number for each posting, and a free block list is kept to store information about blocks containing a sufficient amount of free space (see, “Managing Gigabytes, Compressing and Indexing Documents and Images,” by I. Witten, A. Moffat and T. Bell).
In such conventional indexing techniques, it takes a long time to update the index file when new documents are added to the database and the collection of indexes grows larger. In addition, the amount of free space needed at each time of updating the postings for words is unpredictable.
SUMMARY OF THE INVENTION
The present invention provides information retrieval system or filtering systems for dynamically indexing a set of documents in a database. In particular, the present invention provides methods for indexing a set of documents in a single phase. Information retrieval or filtering systems of the present invention are able to respond to retrieval queries without generating and optimizing internal files.
A single phase indexing technique of the present invention enables the database to be queried at all times. The present invention provides information retrieval or filtering systems that respond to retrieval queries while in the process of indexing. In order to allow retrieval at all times, the present invention stores postings for a word sequentially in memory so that the postings can be retrieved from the memory with a minimum number of input/output (I/O) operations.
The present invention allows information retrieval or filtering systems to incrementally store and update postings for a word while keeping each postings for a word sequentially on memory space. When a new document is inserted in a database which contains many words, the present invention provides information retrieval or filtering systems where the postings for all these words are expanded in a manner of “multipoint insertion” rather than a simple append operation.
In accordance with one aspect of the present invention, a method for allocating the blocks of index file to the postings for words found in documents of a database is provided. The index file is provided with a predetermined size of initial block and the block is partitioned into successively decreasing sized blocks. A block is divided into n blocks of a successive level. The blocks in each successive level have same size. The sum of sizes of blocks in each successive level equals the size of initial block. An information retrieval interface allocates to the postings for a first word a free block in the closest matching level to the size of postings for the first word in the index file. The size of the free block is able to accommodate holding the postings for the first word in the index file.
In accordance with another aspect of the present invention, a method for updating postings for words in an index file is provided. An information retrieval interface allocates blocks of the index file to the postings for words contained in the index file. The blocks are partitioned into successively decreasing levels of blocks in size. The information retrieval interface updates postings for a word in a first block of the index file. The updated postings contain additional postings for the word in added documents of the database. The information retrieval interface searches from a free block list a second block that is free to accommodate the updated postings for the word. The free block list contains information about whether or not a block is free. The information retrieval interface moves the postings for the word from the first block to the second block.
In accordance with a further aspect of the present invention, method for allocating an index file containing postings for words found in documents of a database is provided. The index file is provided with blocks that are partitioned into successively decreasing levels of blocks in size. The blocks in each successive level have same size. The size of postings for a word in the index file is calculated to determine a level that is closest to the postings for the word in the block structure. A free block is searched within the level from a free block list containing information about free blocks of the level to accommodate holding the postings for the word. The free block in the level is allocated to the postings for the word.
A single phase indexing technique of the present invention makes it possible to construct static databases in multiple batches and develop dynamic systems such as information filtering systems. A single phase indexing technique of the present invention enables information retrieval or filtering systems to respond to retrieval queries while in the process of indexing without reorganizing internal files. The present invention supports a collection of dynamically changing variable-length postings for words.


REFERENCES:
patent: 4991087 (1991-02-01), Burkowski et al.
patent: 5704060 (1997-12-01), Del Monte
patent: 5784699 (1998-07-01), McMahon et al.
patent: 5913209 (1999-06-01), Millett
patent: 6374340 (2002-04-01), Lanus et al.
Witten et al., “Indexing and Compressing Full Text Databases for CD-ROM”, Dec. 1990, Journal of Information Science.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Dynamic indexing information retrieval or filtering system does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Dynamic indexing information retrieval or filtering system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Dynamic indexing information retrieval or filtering system will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3298212

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.