Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-10-10
2004-01-13
Alam, Shahid Al (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C704S009000
Reexamination Certificate
active
06678679
ABSTRACT:
Each and every document, including patents and publications, cited herein is incorporated herein by reference in its entirety as though recited in full.
FIELD OF THE INVENTION
The present invention relates generally to data processing apparatus and corresponding methods for the retrieval of data stored in a database or as computer files. More particularly, the present invention relates to methods and systems to facilitate refinement of queries intended to specify data to be retrieved from a target data collection.
BACKGROUND INFORMATION
A significant trend today is the rapid growth in the amount of information available in electronic form. For example, unprecedented amounts of textual and symbolic information are becoming available on intranets and on the Internet as a whole. Unfortunately, tools for locating information of interest in these large collections are quite limited.
Typically, when searching for information on an intranet or the Internet, a user creates a query that is intended to specify a particular information need. An information retrieval system then interprets this query and searches a target data collection to identify items in the collection that are relevant to the query. The retrieval system then retrieves these items, or abstracts thereof, and presents them to the user. In the process of presentation, it is desirable for the retrieved items, or abstracts thereof, to be ranked in the order of their applicability to the expressed information need of the user. Unfortunately, both formation of well-focused queries and informative ranking of retrieved items are quite difficult to do well.
One factor contributing to the difficulty of retrieving information of interest from target collections, such as large full-text databases, is the imprecise nature of human languages. The richness of human language is a strength in expressing ideas with full conceptual generality. In addition, all human languages incorporate significant elements of ambiguity. Both richness and ambiguity create problems from a retrieval perspective.
Approaches to text retrieval are confronted with the fact that multiple words may have similar meanings (synonymy) and a given word may have multiple meanings (polysemy). The typical English word, for example, has at least a half dozen close synonyms. In addition, there generally are a much larger number of broader and narrower terms that are related to any given word of interest. It is often infeasible for a user to anticipate all ways in which an author may have expressed a given concept. A user may, for example, consider using the word car in a query. An author of a document, however, may have used a different term such as automobile or horseless carriage. The author also may have used a more general term, such as vehicle, or narrower terms, such as Ford or Mustang. Failure to include all of these variants in a query will lead to incomplete retrievals.
Polysemy creates a complementary problem. Most words in most languages have multiple meanings. In English, for example, the word fire has several common meanings. It can be used as a noun to describe a combustion activity. It also can be used as a verb meaning to terminate employment or to launch an object. A particularly polysemous word, such as strike, has dozens of common meanings. For the 2000 most polysemous words in English, the typical verb has more than eight common senses, and the typical noun has more than five. Using such a word in a query can result in much extraneous material being retrieved. For example, a person using the word strike in a query, with the intent of retrieving material on labor actions also will be presented material on baseball, air strikes, striking of oil, people who strike up a conversation, etc.
In information retrieval, two metrics generally are applied in evaluating incompleteness and imprecision of retrievals. Recall is a measure of the completeness of retrieval operations. For any given query and any given collection of documents, recall is defined as the fraction of the relevant documents from the collection that are retrieved by the query. Precision is defined as the fraction of retrieved documents that are, in fact, relevant to the users information need. Both metrics typically are expressed as percentages. Historically, text retrieval systems typically have operated at recall and precision levels in the neighborhood of 20 to 30 percent. As the size of full-text databases has grown, however, these numbers have declined.
Information retrieval from the Internet offers a good example of the problems presented by: the tension between precision and recall; and large target collection volume. The types of queries typically formulated by users of Internet search engines frequently result in identification of tens of thousands of web pages as potentially relevant (see FIG.
1
). Upon examination, the vast majority of these pages typically turns out to be irrelevant, i.e., contribute to low precision. That, itself, would be a lesser problem if the results were accurately ranked, i.e., if the most relevant web page was returned as the first result, the next most relevant as the second result, etc. Unfortunately, the quality of ranking as provided by current tools is typically less than optimal. If the search criteria are narrowed to increase precision, some relevant documents might be excluded leading to lower recall.
Most text retrieval systems accept queries in one of two forms, as Boolean logical constructions or as natural language inputs. Boolean constructions involve words connected by Boolean logical operators (e.g., AND, OR, NOT). For example, in response to the query: bear AND NOT (teddy OR beanie), a retrieval system using Boolean queries would retrieve documents that discuss bears, but not those that discuss teddy bears or beanie baby bears. Natural language inputs can take the form of sentences that are produced by the user. Alternatively, documents or portions of documents may be used as queries.
In theory, for any given information need and target information collection, a Boolean query could potentially be constructed that could be used to retrieve relevant information from the target collection with 100% recall and 100% precision. For realistic queries and collections, however, the corresponding Boolean construct might be very large. Some people find it difficult or uneconomical to create complex Boolean queries. This is strongly demonstrated by the observation that, on average, queries used with Internet search engines consists of slightly over two terms. Similar averages are also seen for queries employed on large intranets in government and industry.
While some users are capable of forming non-trivial queries containing more than two terms, some users have learned that modest increases in initial query complexity bring relatively limited increases in the quality of results. Better results are typically obtained when the user reviews selected documents that are retrieved and then makes iterative modifications to the query. Even in this case, however, there are restrictions. First, the initial query often constrains the scope of subsequent iterations. Not knowing what relevant information was excluded by the initial query, the user has few, if any, clues available that indicate how to modify the query to include missed relevant information in subsequent retrievals. Second, it is often difficult to think through what query modifications would be desired in order to improve a given result. The amount of time and effort required to produce effective Boolean queries with current tools is greater than many users invest.
One technique shown to be of considerable value when directly applied to text retrieval is Latent Semantic Indexing (LSI). See S. T. Dumais,
LSI meets TREC: A status report
, THE FIRST TEXT RETRIEVAL CONFERENCE (TREC1), NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY SPECIAL PUBLICATION 500-207, pp. 137-152 (1993) [DUMAIS I]; S. T Dumais,
Latent Semantic Indexing
(
LSI
)
and TREC
-2, THE SECOND TEXT RETRIEVAL CONFERENCE (TR
Al Alam Shahid
Kilpatrick & Stockton LLP
Science Applications International Corporation
LandOfFree
Method and system for facilitating the refinement of data... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for facilitating the refinement of data..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for facilitating the refinement of data... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3212328