Weighting method for use in information extraction and...

Data processing: speech signal processing – linguistics – language – Linguistics – Natural language

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S001000, C707S793000

Reexamination Certificate

active

06240378

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to an information abstracting method, an information abstracting apparatus, and a weighting method, which can be used when extracting prescribed keywords from a plurality of character string data sets divided into prescribed units, such as data provided by teletext services, and also relates to a teletext broadcast receiving apparatus.
2. Description of the Related Art
In recent years, with the advent of the multimedia age, a large variety of information has come to be provided not only in the form of packaged media such as CD-ROMs but also through communications networks, commercial broadcasts, and the like. Such information includes textual information, provided by electronic books, teletext broadcasts, etc., in addition to video and voice information. Textual information is made up of character codes, such as the ASCII code and the JIS code, that can be readily processed by computer. However, for human beings, textual information poses problems in that the amount of information that can be displayed at a time is small, and in that it takes a long time to grasp the main points in it as compared to image information. These problems will become an important concern when we consider increasing amounts of information as the information society advances. Possible approaches to these problems may be by developing techniques for automatically interpreting the content of a document and rendering it into an easy-to-understand form. One such approach is the study of natural language processing in the research field of artificial intelligence. For practical implementation, however, there are many problems yet to be overcome, such as the need for large dictionary and grammatical information and the difficulty in reducing the probability of erroneously interpreting the content of text to a practical level, and so far there are few practical applications.
On the other hand, in recent years, receivers designed to receive teletext broadcasts that are transmitted as character codes over the air have been developed and made commercially available, and textual information provided for homes has been rapidly increasing in volume. In teletext broadcasting, large numbers of programs are provided, and since the information provided is in the form of text, the user can obtain information by reading text displayed on a television screen. This in turn presents a problem in that to grasp the whole content of information the user has to read a large number of characters, turn over the pages in sequence, and so on. In fact, in the case of news or the like, the user usually has no previous knowledge of what will be provided as information, and therefore, is not aware of what information is of interest to him. It is therefore difficult for the user to extract only information that he needs; as a result, the user has to select by himself the necessary information after looking through the whole content of information. This means considerable time has to be spent in getting the necessary information, which has been a barrier to increasing the number of users who enjoy teletext broadcasts. A need, therefore, has been increasing to provide an information abstracting facility for abstracting the main points of information in teletext and displaying only the main points. Some teletext channels broadcast an abstract of major news, but there are still quite a few problems to be overcome, such as, the abstract itself consists of several pages, the content, format, and length of an abstract differ from one broadcast station to another, and so on.
Among information abstracting techniques for document data that have so far been put to practical use is the keyword extraction technique. Intended for scientific papers and the like, this technique involves calculating the frequencies of occurrence of technical terms and the like used in a paper and selecting keywords of high frequency of occurrence to produce an abstract of the paper. The reason that such a technique has been put to practical use is that it is intended for documents, such as papers in specific fields, where the number of frequently used terms is more or less limited. For such fields, it is relatively easy to prepare a dictionary of terms to be extracted as keywords. The keywords automatically extracted using this technique are appended to each paper and used for the sorting out and indexing of the papers.
However, if the above-described keyword extraction technique is applied to the abstracting of a teletext broadcast program, it simply extracts keywords of high frequency of occurrence from the keywords appearing in the program. The result is the extraction of many keywords relating to similar things, and an abstract constructed from such keywords will be redundant. Furthermore, in the case of teletext broadcasts, there often arises a need to extract topics common to a plurality of programs as timely information, besides an abstract of the content of a particular program. For example, when news programs are being broadcast on a plurality of channels, a need may occur to abstract information so that common topics can be extracted as the current trend from the news programs on the different channels. In such cases, a technique is necessary that distinguishes the keywords repeatedly appearing within the same program from the keywords appearing across the plurality of programs. Furthermore, if it is attempted to apply the conventional natural language processing technique to news programs, for example, since news programs tend to contain very many proper nouns, it is not possible to prepare an appropriate terminology dictionary in advance. Accordingly, the prior art techniques cannot be applied as they are. Hence, a need for an information abstracting technique that can handle information trends and that does not require a terminology dictionary.
Moreover, when using keywords as an abstract of information, if keyword extraction is performed simply on the basis of the frequency of occurrence, there arises the problem that many keywords relating to similar things are extracted and relations between keywords are not clear, with the result that the information obtained is short on substance for the number of keywords extracted. For example, when keywords were actually extracted in order of frequency from seven teletext news programs being broadcast in the same time slot, the results as shown in
FIG. 15
were obtained, which shows the seven highest-use keywords. Parts (a) and (b) in
FIG. 15
show the results of experiments conducted on news received at different times. As can be seen from these results, it is clear that the keyword extraction simply based on the frequency of occurrence has a problem as an information abstract. For example, in part (a), the first keyword
and the sixth keyword
both relate to the same topic, but this cannot be recognized from the simple listing of keywords shown in
FIG. 15. A
technique is therefore needed that avoids doubly extracting keywords having similar meanings and that explicitly indicates association between keywords in an information abstract. An approach to this need may be by preparing dictionary information describing keyword meanings and keyword associations, as practiced in the conventional natural language processing technique; however, when practical problems are considered, preparing a large volume of dictionary information presents a problem in terms of cost, and the dictionary information to be prepared in advance must be reduced as much as possible. Furthermore, in the case of teletext news programs, preparing a dictionary itself in advance is difficult since a large number of proper nouns are used. It is therefore necessary to provide a technique that automatically deduces association between keywords without using a dictionary and that abstracts information by also taking association between keywords into account.
The above-mentioned techniques are necessary not only for teletext broadcasts but for character string data

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Weighting method for use in information extraction and... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Weighting method for use in information extraction and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Weighting method for use in information extraction and... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2539949

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.