Method and apparatus for automatic construction of faceted...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C704S009000

Reexamination Certificate

active

06519586

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to automated document searching, and in particular to the introduction of conceptual/terminological structure to a document set based on textual content.
BACKGROUND OF THE INVENTION
The exponential growth of the Internet has provided consumers with the ability to access vast quantities of information—so much, in fact, that guiding consumers to the information they desire is now an industry. Commercial “search engines” such as ALTAVISTA, accessible over the Internet, maintain massive databases of Internet-accessible documents and accept user queries to search these documents.
The search engine may maintain the documents in an unstructured form, in which case the user searches by “keyword.” Essentially, the search engine accepts one or more words that the user considers relevant to the topic of interest, and electronically identifies documents containing the entered words. Search sophistication can be increased by means of Boolean capability, which allows the user to concatenate search terms into strings in accordance with operators such as AND and OR. In practice, it is found that simple keyword queries, while easily composed, tend to underspecify the set of desired documents (retrieving large numbers of irrelevant documents). Such problems arise from the user's lack of knowledge of the subject matter giving rise to the information need, unfamiliarity with the underlying document collection and its content with respect to that need, and the difficulty of translating even a well-defined need into an effective linguistic formulation.
Current search interfaces typically offer a query-refinement loop that allows the user to enter the initial search expression, evaluate the results returned, and then modify the query by addition of keywords. Evaluating search results can be a time- and energy-consuming task, however. In surveying a potentially long list of titles and document summaries, the user must not only evaluate the likely relevance of the retrieved documents, but also assess the likelihood that the database will eventually be able to satisfy the information need (or part of it); assess the degree to which the current query formulation has expressed the need; learn about the information space and the vocabulary used to describe the domain within this particular database; and ultimately decide on an appropriate query reformulation strategy to the extent necessary.
To help the user focus his or her search without this kind of extensive analysis, the documents may be organized according to content, allowing the user to browse through a category of documents or at least to confine a keyword search within such a category. “Clustering” techniques are frequently employed to categorize related documents within a document corpus. But generating the categories and placing the documents within them is an arduous task. Clustering can, for example, be accomplished manually, with each document being individually examined by a clerk who assigns it to the proper category. Naturally, this approach is prohibitive for commercial Internet search engines that store millions of documents.
Clustering can also be performed automatically. “Bottom-up” and “top-down” clustering techniques utilize algorithms that generate a hierarchical category structure and assign each document to one or more categories. These techniques are computationally demanding, however, and do not necessarily generate document categories that ultimately prove meaningful to users.
Another approach to providing users interactive feedback to assist searching is to display terminology “relevant” to the search. The difficulty here is two-fold, first determining which of the thousands of potentially related terms are likely to be most useful in this instance for query reformulation and, second, arranging those terms in some way that helps to elucidate the search space. A manually constructed thesaurus or a database of term-to-term correlations derived from statistical corpus analysis can be used to identify terms that are semantically or statistically related to terms in a user's query expression. Alternatively, a result list can be analyzed at run-time for frequently occurring terms or for phrases containing query terms. In most cases, the terms are simply presented as an unstructured list (perhaps ordered alphabetically or by frequency).
DESCRIPTION OF THE INVENTION
BRIEF SUMMARY OF THE INVENTION
The present invention facilitates searching by extracting, from a collection of documents within a corpus, terms representing key informational concepts (herein referred to as “facets” of the document collection). When the user performs a keyword or other conventional search, the facets pertaining to the documents retrieved by the search are returned to the user along with the documents (which are generally presented in summary form in a results list). The facets may be used directly to refine the search, but also serve to educate the user about the information content of the document corpus and the result list as these relate to the information need.
The invention constructs “faceted” representations of documents by identifying a set of lexical dimensions that roughly characterize concepts likely to have informational relevance. It is found that lexical items signifying key concepts within a domain often tend to co-occur with other useful concepts within certain syntactic contexts, such as noun phrases. Consequently, facets are chosen heuristically based on “lexical dispersion,” a measure of the number of different words with which a particular word co-occurs within such syntactic contexts. The greater the level of dispersion—i.e., the more different words with which the given word appears in the documents within the allowed syntactic context—the greater is the likelihood that the given word (along with the lexical constructs in which it occurs) will represent a useful conceptual category relevant to the query topic. The facets and their corresponding lexical constructs effectively provide a concise, structured summary of the contents of a result set as well as a set of candidate terms for iterative query reformulation.
Accordingly, in a first aspect, the invention comprises a method of selecting and organizing documents from a document corpus in response to a user-provided search expression. Preferably, the document corpus is first analyzed to identify potential facets; this is accomplished by searching the textual content of the documents for lexical constructs conforming to a selected syntactic pattern, such as a noun phrase. The lexical constructs, in turn, are examined at query time to derive dispersion rates for words within the constructs. The dispersion rates are assumed to indicate the conceptual relevance of the words to which they relate, and these words are ranked in accordance with their dispersion rates.
The user's conventional search is processed in the usual fashion, returning to the user a list of documents conforming to the search criteria. The user also receives a list of the facets contained in the retrieved documents. The facets, and the lexical constructs within which they appear, may be used for query reformulation in various ways. The user may, for example, recognize a particular construct as especially relevant to the information need and choose to see a list of documents containing this lexical construct. Alternatively, the user may choose to augment the original search expression with a selected word or lexical construct.


REFERENCES:
patent: 5787421 (1998-07-01), Nomiyama
patent: 5819260 (1998-10-01), Lu et al.
patent: 5913215 (1999-06-01), Rubinstein et al.
patent: 6169986 (2001-01-01), Bowman et al.
patent: 6212494 (2001-04-01), Boguraev
Anick et al., “Exploiting Clustering and Phrases for Context-Based information Retrieval” Proceedings of SIGIR '97, pp. 314-323, 1997.
Bates, Online, “How to use Information Search Tactics Online” pp. 47-54, May 1987.
Meadow et al., “Online Access to knowledge: System Design” JASIS, vol. 40, pp. 86-98,

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for automatic construction of faceted... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for automatic construction of faceted..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for automatic construction of faceted... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3171831

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.