Method and apparatus for indentifying clauses having...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000

Reexamination Certificate

active

06295529

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention is directed to a system for determining a relationship (such as similarity in meaning) between two or more textual inputs. More specifically, the present invention is directed to a system which performs improved information retrieval-type tasks by identifying clauses in documents being searched having certain predetermined characteristics.
The present invention is useful in a wide variety of applications, such as many aspects of information retrieval including indexing, pre-query and post-query processing, document similarity/clustering, document summarization, natural language understanding, etc. However, the present invention will be described primarily in the context of information retrieval, for illustrative purposes only.
Generally, information retrieval is a process by which a user finds and retrieves information, relevant to the user, from a large store of information. In performing information retrieval, it is important to retrieve all of the information a user needs (i.e., it is important to be complete) and at the same time it is important to limit the irrelevant information that is retrieved for the user (i.e., it is important to be selective). These dimensions are often referred to in terms of recall (completeness) and precision (selectivity). In many information retrieval systems, it is important to achieve good performance across both the recall and precision dimensions.
In some current retrieval systems, the amount of information that can be queried and searched is very large. For example, some information retrieval systems are set up to search information on the Internet, digital video discs, and other computer data bases in general. The information retrieval systems are typically embodied as, for example, Internet search engines and library catalog search engines. Further, even within the operating system of a conventional desktop computer, certain types of information retrieval mechanisms are provided. For example, some operating systems provide a tool by which a user can search all files on a given data base or on a computer system based upon certain terms input by the user.
Many information retrieval techniques are known. A user input query in such techniques is typically presented as either an explicit user generated query, or an implicit query, such as when a user requests documents which are similar to a set of existing documents. Typical information retrieval systems search documents in a larger data store at either a single word level, or at a term level. Each of the documents is assigned a relevance (or similarity) score, and the information retrieval system presents a certain subset of the documents searched to the user, (typically that subset which has a relevance score which exceeds a given threshold).
The rather poor precision of conventional statistical search engines stems from their assumption that words are independent variables, i.e., words in any textual passage occur independently of each other. Independence in this context means that a conditional probability of any one word appearing in a document given the presence of another word therein is always zero, i.e., a document simply contains an unstructured collection of words or simply put “a bag of words”.
As one can readily appreciate, this assumption, with respect to any language, is grossly erroneous. Words that appear in a textual passage are simply not independent of each other. Rather, they are highly inter-dependent.
Keyword based search engines totally ignore this fine-grained linguistic structure. For example, consider an illustrative query expressed in natural language: “How many hearts does an octopus have?” A statistical search engine, operating on content words “hearts ” and “octopus”, or morphological stems thereof, might likely return or direct a user to a stored document that contains a recipe that has as its ingredients and hence its content words: “artichoke hearts, squid, onion and octopus”. This engine, given matches in the two content words, may determine, based on statistical measures, that this document is an excellent match. In reality, the document is quite irrelevant to the query.
The art also teaches various approaches for extracting elements of syntactic phrases which are indexed as terms in a conventional statistical vector-space model. One example of such an approach is taught in J. L. Fagan, “Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods”, Ph.D. Thesis, Cornell University, 1988, pp. 1-261. Another such syntactic based approach is described, in the context of using natural language processing for selecting appropriate terms for inclusion within search queries, in T. Strzalkowski, “Natural Language Information Retrieval: Tipster-2 Final Report”,
Proceedings of Advances in Text Processing: Tipster Program Phase
2, Darpa, May 6-8 1996, Tysons Corners, Va., pp. 143-148; and T. Strzalkowski, “Natural Language Information Retrieval”,
Information Processing and Management,
Vol. 31, No. 3, 1995, pp. 397-417. A further syntactic-based approach of this sort is described in B. Katz, “Annotating the World Wide Web Using Natural Language”,
Conference Proceedings of R.I.A.O.
97,
Computer-Assisted Information Search on Internet,
McGill University, Quebec, Canada, Jun. 25-27 1997, Vol. 1, pp., 135-155.
These syntactic approaches have yielded lackluster improvements, or have not been feasible to implement in natural language processing systems available at the time. Therefore, the field has moved away from attempting to directly improve the precision and recall associated with the results of a query, to improvements in the user interface.
Another problem is also prevalent in some information retrieval systems. For example, where documents are indexed, such as in a typical statistical search engine, the index can be very large, depending upon the content set, and number of documents to be indexed. Large indices not only present storage capacity problems, but can also increase the amount of time required to execute a query against the index.
SUMMARY OF THE INVENTION
A system is utilized for determining a relationship between first and second textual inputs. The system identifies clauses in the first textual input having predetermined characteristics indicative of usefulness in determining the relationship. The relationship is then determined based on the clauses identified. The clauses can be eliminated from the first textual input, weighted in the first textual input, or simply annotated.
One embodiment of the invention includes a test methodology which is used in identifying the clauses having predetermined characteristics. The test methodology can be used across a wide variety of content sets, in order to customize the present invention for use with the various content sets.


REFERENCES:
patent: 3704345 (1972-11-01), Coker et al.
patent: 4994966 (1991-02-01), Hutchins
patent: 5845278 (1998-12-01), Kirsch et al.
patent: 5859972 (1999-01-01), Subramaniam et al.
patent: 5873081 (1999-02-01), Harel
patent: 5920854 (1999-07-01), Kirsch et al.
patent: 6018733 (2000-01-01), Subramaniam et al.
“A surface-based approach to identifying discourse markers and elementary textual units in unrestricted texts”, by Daniel Marcu, Information Services Institute, University of Southern California, 1998.
“The Rhetorical Parsing, Summarization, and Generation of Natural Language Texts”, by Daniel Marcu, Dec. 1997, Department of Computer Science, University of Toronto, Toronto, Canada.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for indentifying clauses having... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for indentifying clauses having..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for indentifying clauses having... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2546676

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.