Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-06-24
2001-04-10
Vu, Kim (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C704S231000, C704S251000
Reexamination Certificate
active
06216123
ABSTRACT:
FIELD OF THE INVENTION
This invention relates generally to text indexing, and particularly to a full text index which quickly and efficiently processes search words.
BACKGROUND OF THE INVENTION
The ability to locate relevant documents from a large pool of documents is becoming increasingly desirable. Programs which provide this capability are commonly known as search engines. Search engines typically process a pool of documents and build an index of words. A user can enter a search request, or query, seeking a list of documents that contain certain words. The search engine processes the index and returns a list of documents that satisfy the request. Search engines are used frequently to determine which web sites on the Internet contain relevant content. Search engines are also used to access information from intranets, file servers, and databases. Given the vast amount of data that is electronically available, search engines are becoming increasingly important as a mechanism for finding relevant documents from a large pool of documents.
Efficiency is very important in search engine technology. While inefficient indexing and/or search processing may not be noticeable when the relevant pool of documents is relatively small, inefficiency will quickly lead to excessive index and search processing times when the pool of documents is relatively large. Efficiency is also an important consideration for other aspects of full text indexes, such as processing complex queries, or processing natural language queries. A search engine typically implements natural language searching by breaking a search request into multiple sub-queries. Consequently, if the searching algorithm is inefficient, response time can be seriously degraded.
A search request typically takes the form of one or more words separated by one or more operators, such as AND, OR, or NOT, and proximity restrictions, such as word A within 10 words of word B. The search engine determines which documents satisfy the request, and returns a list of such documents.
A large number of documents may fulfill the search request where the pool, or set, of indexed documents is large. To help the user determine which documents will most likely contain relevant content, many search engines provide a ‘relevance’ ranking for each document that fulfills the search request. The relevance ranking is an estimation provided by the search engine of the importance of the document in view of the particular search request. The ability to rank and present documents to a user in order of their relevance is becoming increasingly important to minimize the time a user must spend in determining which of the many documents that fulfill the search request are, in fact, relevant. Ranking documents by relevance adds additional complexity to the search engine, and presents another potential efficiency consideration. Ideally, the relevance determination will not add significantly to the overall response time of the search engine.
One of the best mechanisms for increasing the efficiency of a search engine is to minimize peripheral input/output (I/O) operations, and in-memory table accesses. A full text index is typically made up of several tables of information, including cross-reference information, and during a search request, many different tables are accessed to make pertinent determinations, including determining in what document a word is located. Full text indexes can be very large, and can take up hundreds of megabytes or more of space. Because of its size, an entire full text index typically will not fit in the memory of a computer, so a table index access will likely result in at least one I/O operation to disk, and, depending on the access methodology, can result in multiple I/Os. An I/O operation is an extremely time-consuming process. Moreover, a single search request may require hundreds of thousands of table accesses, depending on the commonality of the word. Since eliminating or reducing I/O operations can significantly reduce response time, it is beneficial to reduce table accesses. One mechanism for reducing table accesses would be to store word information in a manner that the information itself allows for document level determinations to be made without the need to access a separate document level table.
SUMMARY OF THE INVENTION
It is one object of the present invention to provide a method and system for quickly and efficiently processing search requests against a full text index.
It is another object of the present invention to provide a method and system for generating a full text index that enables very efficient document level access.
It is yet another object of the present invention to provide a method and system that enables document level determinations to be made from word level information.
It is yet another object of the present invention to provide a method and system for accessing a full text index that efficiently and quickly combines results of multiple sub-queries.
It is still another object of the present invention to provide a full text index that substantially reduces table access during search query processing.
It is still another object of the present invention to provide a method and system for making document relevance determinations quickly and efficiently.
Additional objects, advantages and novel features of the invention will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims. To achieve the foregoing and other objects in accordance with the present invention, the method of the present invention includes generating a full text index of a plurality of documents by associating a group of word numbers with the words of a document. Each document has an initial word number and a final word number associated with it. The group of word numbers has a predetermined group size that is greater than one, and the initial word number associated with any document is at least the predetermined group size distance from the initial word number associated with any other document. The full text index includes a word list having a plurality of word entries. Each word entry includes the word, and a list of word numbers identifying where in the documents the word is located. A cross-reference table is generated that contains an entry for each group of word numbers. Each entry contains a reference to the document with which the group of word numbers has been associated.
According to the present invention, the words in a search request are converted to respective word numbers. The full text index is accessed to obtain lists of the word numbers associated with the words of the search request. The full text index of the present invention greatly simplifies and reduces the overhead involved in the determination of whether multiple search words exist in the same document. This simplification is achieved, in part, through the use of a predetermined group size, wherein word numbers that exist in the same group are necessarily in the same document. An efficient hash algorithm can be used to determine if word numbers are members of the same group. A hashing algorithm eliminates the need to conduct a binary search through multiple tables. Thus, the use of a hash greatly reduces the need to access tables to determine document level information, greatly reducing the system overhead used by conventional systems to determine whether two words exist in the same document.
The cross-reference table maintains the relationship between documents and groups of word numbers. Preferably a cross-reference entry exists for each group of word numbers. If the hash algorithm indicates that two word numbers are from different groups, the cross-reference table can be accessed to determine if the two different groups of word numbers were assigned to the same document. The cross-reference table can also be
Millett Ronald P.
Pratt John P.
Robertson David O.
Tietjen Bruce
Dinsmore & Shohl
Ly Anh
Novell Inc.
Vu Kim
LandOfFree
Method and system for rapid retrieval in a full text... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for rapid retrieval in a full text..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for rapid retrieval in a full text... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2509894