Keyword extraction apparatus, keyword extraction method, and...

Data processing: speech signal processing – linguistics – language – Linguistics – Translation machine

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S009000, C704S010000, C704S260000, C707S793000

Reexamination Certificate

active

06173251

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a keyword extraction apparatus, a keyword extraction method and a computer readable recording medium storing a keyword extraction program, which are used in a system for retrieving a document written in natural language to automatically extract keywords from the document beforehand for creating an index of the document in terms of keywords and, at the time of retrieval, to extract a keyword from an input sentence for retrieving the document through collation of the keyword.
2. Description of the Related Art
As a method of retrieving documents in electronic form, it has been hitherto known to previously assign keywords to a document in the form of an index and, at the time of retrieval, to search the document by collating a designated keyword with the keywords assigned to the document. This method has problems in that manually assigning keywords to a document requires a lot of time and labor, and the retrieval cannot work if the keywords assigned by a person who has engaged in creating the index differ from keywords designated by persons who are going to perform retrieval.
For lessening time and labor required to assign keywords, methods of automatically extracting keywords from documents in electronic form have been proposed.
FIG. 64
is a block diagram showing a conventional keyword extraction system disclosed in, for example, Japanese Unexamined Patent Publication No. 8-30627. In
FIG. 64
, denoted by
6401
is a character type discriminating portion for discriminating types of individual characters in an input text and then transferring the discriminated types to character type storage means
6402
. The character type storage means
6402
stores the types and corresponding positions of the individual characters in the input text which have been discriminated by the character type discriminating portion
6401
. Denoted by
6403
is an effective-character-type character string cutting portion for cutting out all effective-character-type character strings, each of which is as long as any of four effective character types, i.e., katakana (the square form of Japanese letters hiragana), kanji (Chinese characters), alphabets and numerals, continue, based on the information stored in the character type storage means
6402
.
Denoted by
6406
is a character-type boundary discriminating portion for discriminating all boundary positions between different character types of all the effective-character-type character strings based on the information stored in the character type storage means
6402
, and then transferring the discriminated positions to character-type segmentation point storage means
6407
. The character-type segmentation point storage means
6407
stores every boundary position, at which the character type changes from one to another, discriminated by the character-type boundary discriminating portion
6406
.
Denoted by
6409
is affix storage means for storing affixes of high frequency.
6410
is an affix discriminating portion for discriminating all affixes in a character string and then transferring the discriminated affixes to affix segmentation point storage means
6411
. The affix segmentation point storage means
6411
stores, as affix segmentation points, positions before and behind all the affixes discriminated by the affix discriminating portion
6410
.
Denoted by
6413
is basic word storage means for storing, as basic words, nouns of high frequency.
6414
is a basic-word discriminating portion for discriminating all basic words in a character string and then transferring the discriminated basic words to basic-word segmentation point storage means
6415
. The basic-word segmentation point storage means
6415
stores, as basic-word segmentation points, positions before and behind all the basic words discriminated by the basic-word discriminating portion
6414
.
Denoted by
6412
is a partial-character-string cutting portion for cutting out partial character strings based on the character-type segmentation points stored in the character-type segmentation point storage means
6407
, the affix segmentation points stored in the affix segmentation point storage means
6411
, or the basic-word segmentation points stored in the basic-word segmentation point storage means
6415
.
Denoted by
6404
is a noun discriminating portion which, when a character succeeding each of the effective-character-type character string cut out by the effective-character-type character string cutting portion
6403
is hiragana, compares the hiragana with hiragana character strings stored in noun-succeeding-hiragana storage means
6405
, and then deletes the effective-character-type character string when a head portion of the hiragana succeeding to that effective-character-type character string does not match with any of the hiragana character strings stored in the noun-succeeding-hiragana storage means
6405
.
Denoted by
6416
is a basic-word deleting portion for deleting the partial character string which matches with any of the basic words stored in the basic word storage means
6413
.
Denoted by
6417
is a necessary keyword storage means for storing keyword character strings designated beforehand.
6418
is a necessary keyword cutting portion which, when character strings matching with the character strings stored in the necessary keyword storage means
6417
appear in a text, cuts out all those character strings and adds them to keywords.
The operation of the conventional keyword extraction system will be described below. The description will be made on the case of entering a text “
(oekaki mohdo=painting mode)”, for example.
First, the character type discriminating portion
6401
discriminates types of individual characters in an input text, and the character type storage means
6402
stores the types and corresponding positions of the individual characters in such a way that the first character is hiragana, the second character is kanji, the third character is kanji, the fourth character is hiragana, and so on.
Next, the effective-character-type character string cutting portion
6403
cuts out “
” and “
”. Since there are no differences in character type within the partial character strings of “
” and “
”, character-type segmentation points are not stored in the character-type segmentation point storage means
6407
. Also, since no affixes are included in the partial character strings of “
” and “
”, affix segmentation points are not stored in the affix segmentation point storage means
6411
. Further, since no basic words are included in the partial character strings of “
” and “
”, basic-word segmentation points are not stored in the basic-word segmentation point storage means
6415
.
Then, since “
” and “
” do not include any of the character-type segmentation point, the affix segmentation point and the basic-word segmentation point, the partial-character-string cutting portion
6412
eventually cut outs two partial character strings of “
” and “
”.
Subsequently, since hiragana “
” succeeding to “
” is not registered in the noun-succeeding-hiragana storage means
6405
, the noun discriminating portion
6404
deletes “
”. On the other hand, since there is no hiragana succeeding to “
”, “
” is not deleted in the noun discriminating portion
6404
. The basic-word deleting portion
6416
then deletes the basic word which matches with any of those stored in the basic word storage means
6413
. If “
” is assumed here not to be a basic word, “
” would not be deleted.
Next, the necessary keyword cutting portion
618
cuts out “
” from the text “
” stored in the necessary keyword storage means
6417
and adds it to keywords. Finally, “
” and “
” are output.
When “
” or “
” is designated as a retrieval key at the time of retrieval, the document including the original text “
” is retrieved.
In retrieval with the thus-constructed keyword extraction system disclosed in Japanese Unexamined Patent Publication No. 8-30627, the retrieval is hit only when the character string designated as a keyword

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Keyword extraction apparatus, keyword extraction method, and... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Keyword extraction apparatus, keyword extraction method, and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Keyword extraction apparatus, keyword extraction method, and... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2540196

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.