Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-05-09
2003-06-17
Choules, Jack (Department: 2177)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000
Reexamination Certificate
active
06581057
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to computer-assisted information storage and retrieval and, more particularly, to producing document summaries and document browsing aids.
2. Description of the Prior Art
As part of search results corresponding to a user query, for example, in an information retrieval system, a query-biased summary generation system provides a document summary that incorporates sentences, sentence fragments, or text spans that are relevant to the user query. The full text of the document must be available in order to create the query-biased summary. Usually, the summary includes the sentences having the greatest number of user query terms that appear most frequently. The summary can also include sentences that are closely related to the query by incorporating synonyms of the query terms into the criteria for the selection of the included sentences. With the current state of the art, the generation of a query-biased summary requires significant processing time.
Current information retrieval systems and information management systems, such as web catalogs, search engines, and document indexes, do not use query-biased summaries. Topical document summaries that are relevant to the user query are not provided. Instead, for example, they present the first few sentences of a document as an indication of the content of that document. These first few sentences may be extracted from the document and stored as a summary of that document for later use in response to a user query. While this technique works well with news stories that use the inverted pyramid style of writing, where the most important facts are mentioned toward the beginning of the article, it does not work well with other text genres that typically do not use the inverted pyramid style.
As a result of the current state of the art, after results pages for a user query are displayed, the user may have to undertake the laborious process of visiting each website listed on the search results pages to determine whether the document listed is relevant. Many users do not have the time or patience to do this. Moreover, a user who leaves the web catalog to examine the mentioned documents for relevancy is more likely to be distracted and not return to the catalog.
A web catalog's revenue generation is primarily dependent on advertisements and, more specifically, on the number of advertisement exposures per second. Since the web catalog generates revenue by exposing a user to advertising, the web catalog generates more revenue when the user remains on the web catalog site for as long of a time as possible. Thus, when a user leaves the catalog to examine the mentioned documents for relevancy and does not return to the web catalog, potential advertising revenues are lost.
A study by Tombros and Sanderson recently showed that query-biased summaries allow users to decide whether a document is relevant without having to read the document. Tombros, Anastasios and Sanderson, Mark,
Advantages of Query Biased Summaries in Information Retrieval,
Proceedings of the 21
st
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Aug. 24-28, 1998, pages 2-10. In this study, users of a typical web catalog output referred to the full text of a document 23.7% of the time. In contrast, with query-biased summaries, users referred to a document only 1.37% of the time. These results led to the conclusion that a query-biased summary provided users with enough clues to judge a document's relevance to the query without the need to read the document itself.
Current web search engines do not provide query-biased document summaries for several reasons. The main reason is that computation time is extremely limited. Since revenue generation is dependent on advertising exposures, response time to a query is critical. Generating query-biased summaries as part of the retrieval process would add enough of a delay to decrease the revenue throughput of the web catalog. The added delay might additionally cause some users to switch to a competitor's faster web catalog. Moreover, web catalogs answer tens of millions of queries per day, and adding a second or two of computation time per query might necessitate the purchase of additional equipment to handle the increased demands on the system.
Since the state of the art for query-biased summarization requires that the full text of the document be available, legal restrictions may prevent web catalogs from producing query-relevant summaries at search time. Current copyright law may restrict the ability of web catalogs to maintain a copy of the full text of a document. Today, practitioners in the field generally believe that copyright law permits web catalogs to store only short excerpts of a document, not the entire document itself. It is also generally believed that web catalogs may retrieve a document's full text in order to index it, so long as the full text is discarded after generation of the index.
Lastly, the size of the web has been estimated to be three terabytes in late 1998 and to be growing at a rate of approximately 35% per year. Storing the full text of every web page so that query-biased summaries can be generated at search time would require a great deal of disk space and may be prohibitive in cost.
SUMMARY OF THE INVENTION
It is an object of the present invention to provide a method and apparatus for rapidly producing document summaries and document browsing aids which do not require storing the full text of the documents.
Accordingly, we have developed a method and apparatus for rapidly producing document summaries and document browsing aids by, at index creation time, precomputing and caching query relevant information required for creating the summaries.
In the specification and claims, the words “cache” and “caching” mean to store data for reuse. For example, a disk cache is random-access memory that stores information retrieved from disk, keeping the most frequently accessed data in memory. Use of a disk cache saves time since it takes less time to retrieve information from memory than from disk. In this application, the word “cache” is used in a similar sense meaning that a precomputed summary or summaries are stored to avoid the need to compute them later when needed in response to a query.
In the specification and claims, the word “term” means single words, word n-grams, and/or phrases. An “n-gram” is a string of characters that may comprise all or part of a word.
The present invention avoids the problems in the current art by extracting topical information for each document at index time and caching either the key information required to generate the summary efficiently at search time or the topical summary. This substantially reduces the computation time and storage requirements and removes the necessity to retain entire documents for producing query-biased summaries. Thus, it becomes feasible for web catalogs and other information retrieval systems to provide topical summaries in the search results pages.
The present invention splits the summary generation process into two parts: one for index time and the other for search time. When the full text of a document is retrieved for indexing (or in an equivalent separate summarization process), the first part generates and stores query relevant information that will be used by the second part to produce or select summaries efficiently at search time.
Since computation time is not as critical at index creation time, the first part of the invention does not need to be particularly time efficient; however, the second part must be extremely time efficient. Caching the information for the summary at index creation time allows the web search engine to generate the summary at search time without requiring the full text of the document or a large amount of computation resources. Thus, the present invention shifts the least time efficient aspects of the summary generation process from the second part (at search time) to the first part (at
Kantrowitz Mark
Mittal Vibhu O.
Witbrock Michael J.
Choules Jack
Justsystem Corporation
Webb Ziesenheim & Logsdon Orkin & Hanson, P.C.
LandOfFree
Method and apparatus for rapidly producing document... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for rapidly producing document..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for rapidly producing document... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3148583