Data processing: presentation processing of document – operator i – Presentation processing of document – Layout
Reexamination Certificate
1998-10-21
2003-10-28
Herndon, Heather R. (Department: 2178)
Data processing: presentation processing of document, operator i
Presentation processing of document
Layout
C715S252000, C715S252000, C707S793000, C707S793000
Reexamination Certificate
active
06638317
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to an apparatus and method for summarizing machine-readable documents written in a natural language, etc., and mainly intends to generate a digest of rather long manuals, reports, etc. and to support the selection and reading processing of documents.
2. Description of the Related Art
As a prime technology related with the present invention there are two technologies of generating a digest by extracting sentences using keywords in a document as a clue, and detecting topic passages in the document. Here, these conventional technologies are described below.
First, the digest generation technology is described below. Roughly speaking, in the conventional digest generation technology, there are two methods. The first method detects major parts in a document and generates a digest by extracting the major parts. The major parts are usually extracted in units of logical elements such as sections, paragraphs, sentences, etc. These are represented by a word “sentence” as described below.
The second method prepares in advance patterns of information to be extracted for a digest, and generates a digest by extracting phrases and words in the document meeting the requirements of one of the patterns, or generates a digest by using sentences matching the pattern.
The first method is further classified into several methods according to with what clue the importance of sentences is evaluated. As typical methods there are the following three methods.
(1) A method of utilizing the use frequency and distribution of words in a document as clues.
(2) A method of utilizing the rhetorical structure and used position of sentences as clues.
(3) A method of evaluating the importance of sentences based on the sentence structure.
Method (1) first evaluates the importance of words (phrases) contained in a document, and then evaluates the importance of sentences according to how many keywords are contained in a sentence. Then, a digest is generated by selecting key sentences based on the evaluation result.
There are several well-known methods of evaluating the importance of words as follows: a method of utilizing the use frequency of words in a document, a method of weighing the use frequency of words with differences between the use frequency of words in the document and that in a more general sentence collection, and a method of weighing the use frequency of words with the used position of words, for example, by setting higher importance to a word in titles or headings.
Here, usually the target words are limited to independent words (particularly nouns) only in the case of Japanese, and content words in the case of English. The independent word and the content word are both words with a substantial meaning, such as nouns, adjectives, verbs, etc., and are distinguished from words used to play a structural role only, such as particles, prepositions, formal nouns, etc. Although the formal definition of an independent word in Japanese is a word which itself can compose an independent clause, here the independent word is defined using the above distinction.
These digest generation methods include, for example, the following. In the Japanese Laid-open Public Patent Publication No. 6-259424 “Document Display Apparatus and Digest Generator Apparatus, and Digital Copying Apparatus” and a document by the inventor of that invention (Masayuki Kameda, “Extraction of Major Keywords and Key Sentences by Pseudo-Keyword Correlation Method”, in the Proceedings of the Second Annual Meeting of Association for Natural Language Processing, pp.97 to 100, March 1996), a digest is generated by extracting parts including many words appearing in the headings as important parts relating to the headings.
In the Japanese Laid-open Public Patent Publication No. 7-36896 “Method and Apparatus for Generating Digest”, major expression seeds are selected based on the complexity (length of a word, etc.) of an expression (word, etc.) used in a document, and a digest is generated by extracting sentences including more seeds having a high importance.
In the Japanese Laid-open Public Patent Publication No. 8-297677 “Method of Automatically Generating a Digest of Topics”, topical terms are detected based on the use frequency of words in a document, and a digest is generated by extracting sentences containing many major topical terms.
In the Japanese Laid-open Public Patent Publication No. 2-254566 “Automatic Digest Generator Apparatus”, words having a high use frequency are detected as keywords, and a digest is generated by extracting parts where the keywords are used in the first place, or parts containing many keywords, sentences which are used at the beginning of semantic paragraphs automatically detected, etc.
Next, the method of detecting topic passages in a document is described below. Roughly speaking, there are the following two methods.
(1) A method based on the lexical cohesion of a topic due to words repeatedly used in a document
(2) A method of determining a rhetorical structure based on the coherence relation between sentences indicated by conjunctions, etc.
For method (1) based on the lexical cohesion, first, the Hearst method (Marti A. Hearst, “Multi-paragraph Segmentation of Expository Text”, in the Proceedings of the 32nd Annual Meeting of Association for Computational Linguistics, pp.9 to 16, 1994) is briefly described below.
This method (hereinafter called “Hearst method”) is one of those automatically detect a break of a topic flow based on the linguistic phenomenon that an identical word is used repeatedly in related parts of text (lexical cohesion). The Hearst method, first, calculates the lexical similarity of every pair of adjacent blocks of text, which are set up before and after a certain position in a document to be of fixed size about a paragraph (approximately 120 words). The lexical similarity is calculated by a cosine measure as follows:
sim
⁡
(
b
1
,
b
r
)
=
∑
t
⁢
W
t
,
b1
⁢
W
t
,
br
∑
t
⁢
W
t
,
b1
2
⁢
∑
t
⁢
W
t
,
br
2
(
1
)
where b
l
and b
r
indicate a left block (a block on the backward side of a document) and a right block (a block on the forward side of the document), respectively, and W
t,bl
and W
t,br
indicate the use frequency of a word t in the left and right blocks, respectively. &Sgr;
t
in the right hand side of equation (1) is a summation operator about different words t.
The more vocabulary common to both the blocks there is, the greater the similarity score of equation (1) becomes (maximum 1). Conversely, if there is no common vocabulary, the similarity score becomes the minimum values 0. That is, a greater value of the similarity score indicates a higher possibility that a common topic is handled in both the blocks, while a smaller value of the similarity score indicates a higher possibility that the point between the blocks is a topic boundary.
The Hearst method compares the value of equation (1) from the beginning of a document until the end at certain intervals (20 words), and recognizes a position having a minimal value as a topic boundary. At this time, the following adjustment is performed in order to neglect the fine fluctuations of the similarity score. First, a part surrounding the point mp having a minimal value (hereinafter called a “minimal point”) is extracted so that the part includes both a part where the similarity score decreases monotonously on the left side of the minimal point and a part where the similarity score increases monotonously on the right side of the minimal point.
Then, based on the similarity scores C
lp
, C
mp
and C
rp
at the start point lp, the minimal point, and end point rp, respectively, of the extracted part, a value ds (depth score), which indicates the fluctuation steepness of the similarity score at the minimal point, is calculated as follows:
ds
=(
C
lp
−C
mp
)+(
C
rp
−C
mp
) (2)
Then, only when ds exceeds a threshold h calculated as follows, is the minimal point recognized as a topic boundary.
h=C
Fujitsu Limited
Herndon Heather R.
Huynh Cong-Lac
Staas & Halsey , LLP
LandOfFree
Apparatus and method for generating digest according to... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Apparatus and method for generating digest according to..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus and method for generating digest according to... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3145129