Method for keyword proximity searching in a document database

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06681219

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to a method for keyword proximity searching in a document database, and more particularly, the present invention optimizes CPU-time and reduces I/O operations associated with keyword proximity searching of a document database.
BACKGROUND OF THE INVENTION
Keyword proximity searching is an important tool for browsing and selecting relevant documents from a large document database. Proximity searching deals with the location and the distance relationship among keywords in a document. For example, the following is a typical keyword proximity search query: given a set of n keywords, find the documents that have a collection of keywords that are within r words from each other. More generally, keyword proximity search may have the form: find the documents in the database that contain all the n search keywords appearing in the same paragraph, page, or within r words, paragraphs, or pages, from each other by implementing a plane sweep search.
Algorithms for answering proximity search queries for a collection of point objects are covered extensively in the literature. Many of the algorithms described in the literature cover the topic from a computational geometry perspective, where the main interest is to develop algorithms that optimize the execution time (i.e., CPU-time) in terms of the input parameters of the problem. On the other hand, when large collections of objects are involved the disk-I/O cost is significant. As a result, from a database perspective, in addition to optimizing CPU-time, the interest is also in reducing the disk-I/O cost for answering such queries.
SUMMARY OF THE INVENTION
The present invention presents a new algorithm for answering keyword proximity search where the words represent objects that are points in the one-dimensional space. The algorithm takes into consideration the fact that the number of words in a document database is typically very large. Hence, it tries to reduce the overhead induced from multiple-retrieval of objects from auxiliary storage. In contrast to other algorithms in the literature, the performance of this new algorithm is optimal. As will be explain below, the overall performance of the algorithm is proportional to the total sizes of the input inverted lists and the size of the reported output tuples.
Assume a given collection of n keywords c
1
. . . c
n
, where each keyword has associated with it a list of instances. Each instance in the list contains the location of the keyword in the document database. For example, let the keyword c
1
be “dog”, c
2
be “cat”, and c
2
be “rat” (i.e., n=3), and the instances correspond to the occurrences of the keywords in various locations of the documents in the database. For example, the instances of the keyword “cat” are <doc1, 15>, <doc1, 123>, <doc2, 25>, and <doc7, 115>which implies that the keyword “cat” appears in document doc1 at locations 15 and 123, in document doc2 at location 25, and in document doc7 at location 115, where doc1, doc2, and doc7 are unique identifies for documents 1, 2, and 7. It should also be noted that page information can be added as part of the location of the word in the document. This helps in extending the proximity search to refer to pages (e.g., find the documents that have the keywords dog and cat within the same page).
The instances of a given keyword are stored in a linear list, where each entry in the list identifies the relative location of the keyword instance as well as an identifier to retrieve the rest of the keyword's description. This is also referred to as an inverted list. Keyword instances are stored in inverted lists so that a given keyword can easily be found in one or more documents. Assume that the entries in a given list (that corresponds to the locations of instances of a given keyword) are sorted according to their location. This is a reasonable assumption if we consider that the lists are constructed by traversing the document and inserting, for each keyword that appears in the documents, the entry (doc-id,pos) at the end of the list for the keyword. As an example, the following is possible entry in the inverted list for the keyword “complex” which appears in documents doc
1
, doc
2
, and doc
4
: [“complex”]—(doc
1
, 6), (doc
1
, 20), (doc
1
, 50), (doc
2
, 30), (doc
4
, 100), (doc
4
, 130). In this case, the list will be sorted in ascending order of the positions of the keyword in the database. For the purposes of the present invention it does not matter whether the elements are sorted in either ascending or descending order.
In contrast to a conventional nested loop searching algorithm, the goal of the present invention is to reduce the number of times the lists are traversed. The keyword proximity search algorithm of the present invention is based on a plane-sweep approach and performs like a merge join algorithm while trying to take advantage of the nature of the proximity condition. This is especially important for large documents and large document databases where the number of instances in each keyword entry is large and the lists of instances cannot be assumed to fit in main memory.
Accordingly, a proximity search algorithm is proposed for answering proximity search queries in a document database. The present invention alternates between two modes of operation to find the matching output tuples. A plane-sweep mode is used to efficiently search the inverted lists, thereby excluding tuples that do not contribute to the output. Once an output tuple satisfying the proximity search query is detected, the plane-sweep mode is terminated. Because of the nature of the proximity search query, some tuples that are in the proximity of the output tuple also satisfy the proximity condition. Hence, the algorithm switches to a nested-loop mode. The nested-loop mode performs a local nested loop join to enumerate all possible combinations of the keyword instances that satisfy the proximity search query in the neighborhood of the output tuple detected by plane-sweep mode. Upon enumerating all of these output tuples, the algorithm switches back to the plane-sweep mode.
For a more complete understanding of the invention, its objects and advantages refer to the following specification and to the accompanying drawings.


REFERENCES:
patent: 5499360 (1996-03-01), Barbara et al.
patent: 5657450 (1997-08-01), Rao et al.
patent: 6212517 (2001-04-01), Sato et al.
patent: 6377945 (2002-04-01), Risvik

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for keyword proximity searching in a document database does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for keyword proximity searching in a document database, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for keyword proximity searching in a document database will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3206845

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.