Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-01-13
2001-03-20
Feild, Joseph H. (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000
Reexamination Certificate
active
06205456
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to an apparatus for summarizing an electronic document written in a natural language, and has been developed to select and access a large volume of retrieved documents, and access, restructure (repeatedly use), and support the management process of a large volume of accumulated documents.
Recently, documents have been stored on electronic media, and an explosively-increasing number of document s are accessed and repeatedly used on computers using new document communications media such as the Internet/Intranet, etc. Under the circumstances, the technological development is accompanied by a larger volume and a larger variety of technological documents, thereby increasing the number of requests for accumulating and repeatedly-using a large volume of documents.
With such a large volume of documents, the effectiveness of each document should be quickly determined to select an appropriate document to the purpose. To attain this, it i s necessary to display a list of documents together with the information implying the contents of the documents. The information to the purpose can be a title or an abstract of a document. However, the title may not practically represent the contents of the document, or an abstract may be missing. When a document is accessed online, the number of characters to be displayed is limited. Therefore, an abstract may not be appropriately displayed because it contains too many characters. Thus, a technology of automatically generating an appropriate summary is earnestly demanded.
When documents are used efficiently and repeatedly, a large volume of documents should be properly classified and arranged when accumulated. At this time, an appropriate summarization is required to quickly understand the contents of a new document to be classified, obtain the outline of the classification so that the administrator of the accumulated document can improve the classification system, and to inform a user unfamiliar with the classification system of the actual classification.
The feature of the present invention is to adjust a summarization result using the document summarization apparatus depending on the focused concept and the known concept of the user.
2. Description of the Related Art
There have been two major methods of generating the summary of a document in the conventional document summarization technology. The first method is to recognize and extract an important portion in a document (normally the logical elements of a document such as a sentence, a paragraph, a section, etc., and hereinafter referred to as a sentence), and generate a summary. The second method is to prepare a pattern of information to be extracted as a summary and make a summary after extracting words or phrases in the document according to the condition of the pattern or extracting sentences according to the pattern. Since the second method is little related to the present invention, the first method is described below.
The first method is further divided into a few submethods depending on what is the key to the evaluation of the importance of a sentence. A typical method depends on:
1. occurrence and distribution of words in a document; and
2. coherence relation between sentences and position where a sentence appears.
(The importance of a sentence can also be evaluated by the syntax pattern of a sentence, but this method is omitted here because it hardly relates to the present invention.)
In method 1, that is, the method depending on the occurrence and distribution of words in a document, the importance of a word (phrase) contained in a document is normally determined first, and then the importance of the sentence is evaluated depending on the number of important words contained in the sentence. Then, an important sentence can be selected and a summary is generated. The importance of a word is calculated by using the occurrence of the word in a document, which can be weighed by taking into account the deviation of the occurrence of the word from the occurrence of the word in a common document set or the position where the word appears (a word appearing in a title is regarded as an important word, etc.). Normally, a focused word is an independent word in Japanese (especially a noun), and a content word in English. An independent word and a content word refer to a word having a substantial meaning such as a noun, adjective, verb, etc. that can be distinguished from syntactic words such as a preposition, an auxiliary, etc. The formal definition of an independent word in Japanese implies a word which can form part of an independent section in a sentence. This is a little different from the description above, but the purpose of limiting a focused word to an independent word is described above.
For example, method 1 is described in the following document.
In the Japanese Laid-open Patent Publication (Tokkaihei) No. 06-259424 “Document Display Apparatus, Document Summarization Apparatus, and Digital Copy Apparatus” and the following document 1 by the same author, a summary is generated by extracting a portion containing a number of words contained in the title as an important portion related to the title.
Document 1: Masayuki Kameda, “Extraction of Important Keyword and Important Sentence by Pseudokeyword Correlation Method”, disclosed in the second annual meeting, Association for Natural Language Processing, pp. 97-100, March 1996.
In the Japanese Laid-open Patent Publication (Tokkaihei) No. 07-36896 “Document Summarization Method and Apparatus”, a seed for an important representation is selected based on the complexity (word length, etc.) of the representation (word, etc.) in a document, and a summary is generated by extracting a sentence containing a larger number of important seeds.
In the Japanese Laid-open Patent Publication (Tokkaihei) No. 08-297677 “Automatic Method of Generating Summary of Subject”, words of main subjects are recognized in order from the highest occurrence of a word in a document, and a summary is generated by extracting a sentence containing a larger number of important subject words.
In the Japanese Laid-open Patent Publication (Tokkaihei) No. 06-215049 “Document Summarization Apparatus”, a summary is generated by extracting a sentence from a sentence or paragraph having a feature vector similar to that of the entire document after applying a vector space model often used in determining the relevance between a retrieval result and a question sentence. A vector space model refers to representing a feature of a document and a query sentence using a feature vector indicating the existence or occurrence of a word in the document and the query sentence after assigning a dimension (axis) to each keyword or each meaning element of a word.
In method 2 depending on the coherence relation between sentences and the position of the sentence, an important sentence is selected by determining the (relative) importance of the sentence based on the conjunction (also referred to as the coherence relation) of sentences such as ‘and’, ‘but’, ‘then’, etc., and the position where a sentence appears in a document. This method is described in, for example, the Japanese Laid-open Patent Publication (Tokkaihei) No. 07-182373 “Document Information Retrieval Apparatus and Document Retrieval Result Display Method” and the following document 2 by the same applicant and document 3 by other applicants.
Document 2: Kazuo Sumita, Tetsuo Tomono, Kenji Ono, and Seiji Miike. “Automatic abstract generation based on document structure analysis and its evaluation as document retrieval presentation function”. Transactions of the Institute of Electronics, Information and Communication Engineers, Vol.J78-D-II, No. 3, pp.511-519, March 1995 (in Japanese).
Document 3: Kazuhide Yamamoto, Shigeru Masuyama, and Shozo Naito. “GREEN: An experimental system generating summary of Japanese editorials by combining multiple discourse characteristics”. IPSJ SIG Notes NL-99-3, Information Processing Society of Japan, January 1994 (in Japanese)
Feild Joseph H.
Fujitsu Limited
Kindred Alford W.
Staas & Halsey , LLP
LandOfFree
Summarization apparatus and method does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Summarization apparatus and method, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Summarization apparatus and method will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2507654