Summarizing text documents by resolving co-referentiality...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S009000

Reexamination Certificate

active

06185592

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to a system and method for reviewing documents. More particularly, the present invention relates to characterizing documents in a manner that allows the user to quickly ascertain their contents.
BACKGROUND OF THE INVENTION
Documents obtained via an electronic medium (i.e., the Internet or on-line services, such as AOL, Compuserve or other services) are often provided in such volume that it is important to be able to summarize them. Oftentimes, it is desired to be able to quickly obtain a brief (i.e., a few sentences or a paragraph length) summary of the document rather than reading it in its completeness. Most typically, such documents span several paragraphs to several pages in length. This invention concerns itself with this kind of document, hereinafter referred to as average length document. Summarization of document content is clearly useful for assessing the contents of items such as news articles and press releases, where little a priori knowledge is available concerning what a document might be about; summarization or abstraction facility is even more essential in the framework of emerging “push” technologies, where a user might have very little control over what documents arrive at the desktop for his/her attention.
Conventional summarization techniques for average length documents fall within two broad categories. One category is those techniques which rely on template instantiation and the other category is those techniques that rely on passage extraction.
Template Instantiation
A template is best thought of as a set of predefined categories for a particular domain. A template instantiation technique for content summarization is based on seeking to instantiate the plurality of such categories with values obtained from the body of a document-assuming that the document fits the expected domain. These types of techniques are utilized for documents that can be conveniently assigned to a well-defined domain and are known to belong to such a domain. Examples of such constrained domains are news stories about terrorist attacks or corporate mergers and acquisitions in the micro-electronics domain.
Template instantiation systems are specially designed to search for and identify predefined features in text: restricting documents to a domain whose characteristic features are known ahead of time allows a program to identify specific aspects of the story such as: ‘who was attacked’, ‘who was the perpetrator’, ‘was the acquisition friendly or hostile’, and so forth. A coherent summary can be then constructed by “fitting” the facts in a template. Unfortunately, these systems are by design limited to the particular subject domains they were engineered to cover because the systems, in effect, search for particular words and word patterns and can only function assuming their existence, and mapping onto, the domain categories (see Ralph Grishman, “Information Extraction: Techniques and Challenges”, in M. T. Pazienza (Ed.), “Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology”, Springer, 1997, and references therein).
Sometimes, a set of proper names and technical terms can be quite indicative of content. Phrasal matching techniques, developed for the purposes of template instantiation, are able to provide a list of the pertinent terms within a document. Such techniques have grown to become quite robust (see J. S. Justeson and S. M. Katz, “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text”, Journal of Natural Language Engineering, vol.1(1), 1995; see also “Coping with Unknown Lexicalizations”, in B. K. Boguraev and J. Pustejovsky (Eds.), “Corpus Processing for Lexical Acquisition”, MIT Press, 1996). If a document is small enough then complete lists of proper names and technical terms can provide a relatively informative characterization of the document content. However, for longer documents the term list will be plagued by unnecessary and incorrect terms, ultimately defeating their representativeness as content abstractions.
Accordingly, this type of summarization technique requires a front end analysis sensitive to a domain description, and capable of filling out domain-specific templates which will provide for accurate summarization of the document; thus it depends on knowing, a priori, the document's domain.
Passage Extraction
Passage extraction techniques do not depend on prior knowledge of the domain. They are based on identifying certain passages of text (typically sentences) as being most representative of the document. This type of technique typically uses a statistical approach to compute the “closeness” between a sentence and the document as a whole. Generally speaking, this closeness is determined by mapping individual sentences, as well as the entire document, on to multidimensional vector space, and then performing mathematical calculations to determine how similar (by some appropriate metric) the sentence is to the text. Generally speaking, if a sentence has many words which repeatedly appear throughout the document, it will receive a relatively high score. Then, the highest ranking sentence(s) is (are) presented as a summary of the document.
Such “summarization” programs, some of which are beginning to get deployed commercially, do not provide true summaries, in the sense of a summary being e.g. an abstract capturing the essential, core content of a document. While being more indicative of what a document is about, when compared with only a title, for instance, such a set of sentences is under-representative of all the topics and themes possibly running through a document. A document may have several important topics discussed therewithin. Unfortunately, in such documents, while a small selection of sentences typically conveys the information relating to one topic, they may fail to convey the existence of other topics in the document.
Accordingly, what is needed is a system and method for analyzing documents to a finer grain of topic identification and content characterization than when utilizing conventional techniques. In a preferred embodiment of the invention, the system and method should be able to analyze documents with multiple topics. The analysis would be used to produce summary-like abstractions of the documents. The system and method should be easy to implement and cost-effective. Furthermore, the content abstractions should contain relevant information from throughout the document, not just a selection of sentences that may miss significant topics. The present invention addresses these needs.
SUMMARY OF THE INVENTION
A method and system for characterizing the content of a document is disclosed. The method and system comprise identifying a plurality of discourse referents in the document, dividing the document into topically relevant document segments, and resolving co-referentiality among the discourse referents within, and across, the document segments. The method and system also comprises calculating salience values for the discourse referents based upon the resolving step, and determining topic stamps for the document segments based upon the salience values of the associated discourse referents. Finally the method and system comprise providing summary-like abstractions, in the form of capsule overviews of each of the segments derived from its topic stamps. In so doing, a capsule overview is derived for the entire document, which will depict the core content of an average length article in a more accurate and representative manner than utilizing conventional techniques.


REFERENCES:
patent: 5077668 (1991-12-01), Doi
patent: 5130924 (1992-07-01), Barker et al.
patent: 5384703 (1995-01-01), Withgott et al.
patent: 5768580 (1998-06-01), Wical
patent: 5778397 (1998-07-01), Kupiec et al.
patent: 5794178 (1998-08-01), Caid et al.
patent: 5918240 (1999-06-01), Kupiec et al.
patent: 5924108 (1999-07-01), Fein et al.
patent: 5937422 (1999-08-01), Nelson et al.
patent: 5960384 (1999-09-01), Brash
patent: 5963940 (199

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Summarizing text documents by resolving co-referentiality... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Summarizing text documents by resolving co-referentiality..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Summarizing text documents by resolving co-referentiality... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2563797

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.