Data processing: speech signal processing – linguistics – language – Linguistics – Natural language
Reexamination Certificate
1999-07-12
2001-11-13
Edouard, Patrick N. (Department: 2644)
Data processing: speech signal processing, linguistics, language
Linguistics
Natural language
C707S793000
Reexamination Certificate
active
06317708
ABSTRACT:
COPYRIGHT NOTICE
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
BACKGROUND OF THE INVENTION
Extractive summarization is the process of selecting and extracting text spans—usually whole sentences—from a source document. The extracts are then arranged in some order (usually the order as found in the source document) to form a summary. In this method, the quality of the summary is dependent on the scheme used to select the text spans from the source document. Most of the prior art uses a combination of lexical, frequency and syntactic cues to select whole sentences for inclusion in the summary. Consequently, the summaries cannot be shorter than the shortest text span selected and cannot combine concepts from different text spans in a simple phrase or statement. U.S. Pat. No. 5,638,543 discloses selecting sentences for an extractive summary based on scoring sentences based on lexical items appearing in the sentences. U.S. Pat. No. 5,077,668 discloses an alternative sentence scoring scheme based upon markers of relevance such as hint words like “important”, “significant” and “crucial”. U.S. Pat. No. 5,491,760 works on bitmap images of a page to identify key sentences based on the visual appearance of hint words. U.S. Pat. Nos. 5,384,703 and 5,778,397 disclose selecting sentences scored on the inclusion of the most frequently used non-stop words in the entire text.
In contrast to the large amount of work that has been undertaken in extractive summarization, there has been much less work on generative methods of summarization. A generative method of summarization selects words or phrases (not whole sentences) and generates a summary based upon the selected words or phrases. Early approaches to generative methods are discussed in the context of the FRUMP system. See DeJong, G. F., “An Overview of the FRUMP System”,
Strategies for Natural Language Processing
, (Lawrence Erlbaum Associates, Hillsdale, N.J. 1982). This system provides a set of templates for extracting information from news stories and presenting it in the form of a summary. Neither the selection of content nor the generation of the summary is learned by the system. The selection templates are handcrafted for a particular application domain. Other generative systems are known. However, none of these systems can: (a) learn rules, procedures, or templates for content selection and/or generation from a training set or (b) generate summaries that may be as short as a single noun phrase.
The method disclosed herein relates somewhat to the prior art for statistically modeling of natural language applied to language translation. U.S. Pat. No. 5,510,981 describes a system that uses a translation model describing correspondences between sets of words in a source language and sets of words in a target language to achieve natural language translation. This system proceeds linearly through a document producing a rendering in the target language of successive document text spans. It is not directed to operate on the entire document to produce a summary for the document.
SUMMARY OF THE INVENTION
As used herein, a “summary string” is a derivative representation of the source document which may, for example, comprise an abstract, key word summary, folder name, headline, file name or the like. Briefly, according to this invention, there is provided a computer method for generating a summary string from a source document of encoded text comprising the steps of:
a) comparing a training set of encoded text documents with manually generated summary strings associated therewith to learn probabilities that a given summary word or phrase will appear in summary strings given that a source word or phrase appears in an encoded text document; and
b) from the source document, generating a summary string containing a summary word, words, a phrase or phrases having the highest probabilities of appearing in a summary string based on the learned probabilities established in the previous step. Preferably, the summary string contains the most probable summary word, words, phrase or phrases for a preselected number of words in the summary string.
In one embodiment, the training set of encoded manually generated summary strings is compared to learn the probability that a summary word or phrase appearing in a summary string will follow another summary word or phrase. Summary strings are generated containing the most probable sequence of words and/or phrases for a preselected number of words in the summary string.
In a preferred embodiment, the computer method, according to this invention, comprises comparing a training set of encoded text documents with manually generated summary strings associated therewith to learn the probabilities that a given summary word or phrase will appear in summary strings given a source word or phrase appears in the encoded text considering the context in which the source word or phrase appears in the encoded text documents. For example, the context in which the source words or phrases may be considered includes titles, headings, standard paragraphs, fonts, bolding, and/or italicizing.
In yet another preferred embodiment, the computer method, according to this invention, further comprises learning multiple probabilities that a summary word or phrase will appear in a summary string given a source word or phrase appears in the encoded text and considering the various usages of the word or phrase in the encoded text, for example, syntactic usages and semantic usages.
In a still further preferred embodiment, according to this invention, the step for comparing a training set of encoded manually generated summary strings takes into consideration external information in the form of queries, user models, past user interaction and other biases to optimize the form of the generated summary strings.
BRIEF DESCRIPTION OF THE DRAWING
Further features and other objects and advantages will become clear from the following detailed description made with reference to the drawing which is a schematic diagram illustrating the processing of text to produce summaries.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
Referring now to the drawing, a collection of representative documents are assembled at
10
and corresponding manually generated summaries are assembled at
11
. These comprise a training set. They are encoded for computer processing and stored in computer memory. They may be preprocessed to add syntactic and semantic tags.
The documents and summaries are processed in the translation model generator at
12
to build a translation model
13
which is a file containing the probabilities that a word found in a summary will be found in the document. The translation model generator constructs a statistical model describing the relationship between the text units or the annotated text units in documents and the text units or annotated text units used in the summaries of documents. The translation model is used to identify items in a source document
17
that can be used in summaries. These items may include words, parts of speech ascribed to words, semantic tags applied to words, phrases with syntactic tags, phrases with semantic tags, syntactic or semantic relationships established between words or phrases in the document, structural information obtained from the document, such as positions of words or phrases, mark-up information obtained from the document such as the existence of bold face or italics, or of headings or section numbers and so forth.
The summaries are processed by the language model generator
14
to produce a summary language model
15
. The language model is a file containing the probabilities of each word or phrase found in the training set summaries following another word or phrase. The language model generator buil
Mittal Vibhu O.
Witbrock Michael J.
Edouard Patrick N.
Justsystem Corporation
Webb Ziesenheim & Logsdon Orkin & Hanson, P.C.
LandOfFree
Method for producing summaries of text document does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for producing summaries of text document, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for producing summaries of text document will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2601657