Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-02-01
2003-12-16
Robinson, Greta (Department: 2177)
Data processing: database and file management or data structures
Database design
Data structure types
Reexamination Certificate
active
06665666
ABSTRACT:
FIELD OF THE INVENTION
This invention relates to the field of querying and searching collections of text. More specifically, the invention relates to querying and searching collections of text in a networking environment.
BACKGROUND OF THE INVENTION
Text documents contain a great deal of factual information. For example, an encyclopedia contains many text articles consisting almost entirely of factual information. Newspaper articles contain many facts along with descriptions of newsworthy events. The World Wide Web contains millions of text documents, which in turn contain at least a small amount of factual information.
Given this collection of factual information, we naturally desire the ability to answer questions based on this information using automatic computer programs. Previously two kinds of computer programs have been created to search factual information: database management systems and information retrieval systems. A database management system (DBMS) assumes that information is stored in a structured fashion, such that each data element has a known data type and a set of legal operations. For example, the relational database management system (RDBMS) provides the Structured Query Language, or SQL, which specifies a syntax and grammar for the formulation of queries against the database. SQL is based on a relational calculus that restricts queries to include only certain operations on certain data types, and certain combinations of those operations.
A relational database is tailored to applications where the factual information is available in a structured form. To address the factual information contained in free text documents, information retrieval (IR) systems were created. An information retrieval system indexes a collection of documents using textual features (e.g., words, noun phrases, named entities, etc.). The document collection can then be searched using either Boolean queries or natural language queries. A Boolean query consists of textual features and Boolean operators (e.g., and, or, not). To evaluate a Boolean query, an IR system returns the set of documents that satisfies the Boolean expression. A natural language query is a free form text query that describes the user's information need. Documents likely to satisfy this information need are then found using a retrieval model that matches the query with documents in the collection. Popular models include the probabilistic and vector space models, both of which use text feature occurrence statistics to match queries with documents. In all cases, an IR system only identifies entire documents that are likely to satisfy a user's information need.
Ideally, the user would be able to phrase a specific question, e.g., “What is the population of the world?”, and the computer program would respond with a specific answer, e.g., “6 billion”. Moreover, the computer program will produce these answers by analyzing the factual information available in the vast supply of text documents, examples of which were given previously. Thus, the problem at hand is how to automatically process free text documents and provide specific answers to questions based on the factual information contained in the analyzed documents.
STATEMENT OF PROBLEMS WITH THE PRIOR ART
When users have an information need, search engines are typically used to find the desired information in a large collection of documents. The user's query is treated as a bag of words, and these are matched against the contents of the documents in the collection; the documents with the best matches, according to some scoring function, are returned at the top of a hit-list. Such an approach can be quite effective in the case one is looking for information about some topic. However, if one desires an answer to a question, a different approach has to be attempted for the following reasons: (1) Using a standard search engine approach, the user gets back documents, rather than answers to a question. This then requires browsing the documents to see if they do indeed contain the desired answers (which they may not) which can be a time consuming process. (2) No attempt is made to even partially understand the question, and make appropriate modifications to the processing. So, for example, if the question is “Where is XXX”, the word “where” will either be left intact and submitted to the search engine, which is a problem since any text giving the location of XXX is very unlikely to include the word “where”, or the word will be considered a stop-word and stripped out, leaving the search engine with no clue that a location is sought.
The above discussion describes the most commonly found situation. There are two approaches that have been used in an attempt to provide better service for the end-user.
The first of these does not directly use search engines at all, and is currently in use by AskJeeves (www.askjeeves.com). This approach uses a combination of databases of facts, semantic networks, ontologies of terms and a way to match user's questions to this data to convert the user's question to one or more standard questions. Thus the user will ask a question, and the system will respond with a list of questions that the system can answer. These latter questions match the user's question in the sense that they share some keywords in common. A mapping exists between these standard questions and reference material, which is usually in the form of topical Web pages. This is done by generating for these pages one or more templates or annotations, which are matched against the user's questions. These templates may be either in natural-language or structured form. The four major problems with this approach are:
(1) Building and maintaining this structure is extremely labor-intensive and potentially error-prone, and is certainly subjective.
(2) When new textual material (such as news articles) comes into existence, it cannot automatically be incorporated in the “knowledge-base” of the system, but must wait until a human makes the appropriate links, if at all. This deficiency creates a time-lag at best and a permanent hole in the system's capability at worst. Furthermore, using this approach, only a pointer to some text where the answer may be found is given instead of the answer itself. For instance, asking: How old is President Clinton? returns a set of links to documents containing information about President Clinton, however, there is no guarantee that any of these documents will contain the age of the President. Generating these templates automatically cannot be done accurately with the current state of the art in automatic text understanding.
(3) It can easily happen that there is no match between the question and pre-stored templates; in such cases these prior art systems default to standard (non-Question-Answering) methods of searching.
(4) There is no clear way to compute the degree of relevance between the question and the text that is returned, so it is not straightforward to determine how to rank-order these texts.
The second approach uses traditional search-engines with post-processing by linguistic algorithms, and is the default mechanism suggested and supported by the TREC-8 Question-Answering track. In this approach, a question is submitted to a traditional search engine and documents are returned in the standard manner. It is expected that many of these documents will be false hits, for reasons outlined earlier. Linguistic processing is then applied to these documents to detect one (or more) instances of text fragments that correspond to an answer to the question in hand. The thinking here is that it is too computationally expensive to apply sophisticated linguistic processing to a corpus that might be several gigabytes in size, but it is reasonable to apply such processing to a few dozen or even a few hundred documents that come back at the top of the hit list. The problem with this approach, though, is that, again for reasons given earlier, even the top documents in the hit-list so generated might not contain the sou
Brown Eric William
Coden Anni R.
Prager John Martin
Radev Dragomir Radkov
Percello Louis J
Rayyan Susan
Robinson Greta
LandOfFree
System, method and program product for answering questions... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System, method and program product for answering questions..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System, method and program product for answering questions... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3156086