Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1998-01-09
2001-02-06
Tsang, Fan (Department: 2748)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S251000, C704S240000, C704S236000, C707S793000
Reexamination Certificate
active
06185531
ABSTRACT:
FIELD OF INVENTION
This invention generally relates to automatically relating a story to a number of topics. Herein “story” is defined to include any article of information that may be presented to the inventive method, usually from a stored file. More particularly, the present invention relates to automatically indexing and associating stories to a concurrent number of topics ranging from broad to very detailed topics. The present invention is particularly well suited to be combined with voice to text converters, optical readers, and fax encoded data wherein the story (text) is automatically indexed to one or more topics or categories.
BACKGROUND OF INVENTION
Electronic news sources, including radio, television, cable, and the Internet have vastly increased the amount of information available to the public. Government agencies and businesses depend on this information to make informed decisions. In order to quickly access a particular type of information, it is important that this information be indexed according to topic type or subject. Topics of interest may be broadly defined, for example the U.S. President, or more narrowly defined, for example the U.S. President's trip to Russia. Manual sorting methods require excessive time and provide a limited number of topics with limited scope.
Automatic methods have been developed that transcribe and index and relate each story to one or more topics. These techniques typically model a topic by counting the number of times each word is used within a story on a known topic. To classify a new story, the relative frequencies of all the words in that story related to each topic are multiplied together. The topic with the highest product is selected as the “correct” topic. A limitation of such methods is that most words in a story are not “about” that topic, but just general words. In addition, real stories have several topics and these prior art methods assume that each word is related to all the various topics. In particular, classification of all words in a story create a limitation because a keyword (a word that is related to a topic) for one topic becomes, in effect, a negative when the keyword is classified (albeit with low probability) for another legitimate topic. This is a particular limitation of the prior art techniques. The result is that these prior art techniques have limited ability to discriminate among the various topics which, in turn, limits the accuracy with which stories can be indexed to any particular topic. One example of a prior art technique for automatic story indexing against subject topics is described in a paper entitled,
Application of Large Vocabulary Continuous Speech Recognition to Topic and Speaker Identification Using Telephone Speech,
by Larry Gillick, et al. from the Proceedings ICASSP-93, Vol. II, pages 471-474, 1993.
It is an object of the present invention to provide a method that acknowledges that there are generally several topics within any given story and that any particular word need not be related to all the topics, while providing for multiple topics and their related words.
It is another related object of the present invention to realize that many, if not most, words used in a story are not related to any topic but are words used in a general sense.
An object of the present invention is to provide a method where there is reduced overlap between the various topics within any one story.
It is yet another object of the present invention to provide a method of improved accuracy of topic identification.
It is yet another object of the present invention to improve topic identification by automatically determining which keywords relate to which topics, and then using those keywords as positive evidence for their respective topics, but not as negative evidence for other topics.
SUMMARY OF THE INVENTION
The foregoing objects are met in a superior method for indexing and relating topics to a story including representing each topic as having a probability that any word in a story may be related to the particular topic. However, the present method provides for the realization that not all words in a particular story are related to a given topic or topics within that story—some words are not related to any topic. The present method provides for a general topic category for those words, and hereinafter the use of “subject topic” is used to distinguish this general topic category. Usually, most of the words in a story fall into the general topic category thereby reducing the number of words that are keywords related to any given subject topic. This reduction simplifies and enhances the efficiency of finding keywords related to the various subject topics. In a preferred embodiment the subject topics for a story are modeled as states within a Hidden Markov Model with probabilities that story words are related to the subject topics.
The inventive method requires a training set of stories, where each story has been associated with several subject topics by a human annotator, based on the content within the story. Some of the subject topics may be very broad, like “money” or “politics” while others may be very detailed, like “U.S. Foreign Aid to Mexico” or “Election Campaign in Japan”, and others may be names of specific people or companies or locations.
The inventive method compiles a list of (the union of) all topics used for all the training stories. The topic “General Language” is an added topic for every story. Then, by examining all the stories, words, and topics, the inventive method determines the probability that each word will be used for each topic, as well as the prior probabilities of each of the subject topics and sets of subject topics. This is accomplished using the Expectation-Maximization (EM) algorithm in an iterative fashion to maximize the likelihood of the words in the stories, given the subject topics.
REFERENCES:
patent: 5625748 (1997-04-01), McDonough
patent: 5687364 (1997-11-01), Saund
patent: 5918236 (1999-06-01), Wical
Larry Gillick et al., “Application of Large Vocabulary Continuous Speech Recognition to Topic and Speaker Indentification Using Telephone Speech,” Proc. of 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Miineapolis, MN, pp. 471-474, Apr. 27-30, 1993.
R.C. Rose et al., “Techniques for Information Retrieval from Voice Messages,” Proc. ICASSP-91, Toronto, Canada, pp. 317-320, May 1991.
John Makhoul and Richard Schwartz, “State of the Art in Continuous Speech Recognition,” Proc. Natl. Acad. Sci. USA, vol. 92, pp. 9956-9963, Oct. 1995.
Barbara Peskin et al., “Improvements in Switchboard Recognition and Topic Identification,” Proc. 1996 IEEE International Conference on Acoustics, Speech & Signal Processing (ICASSP), Atlanta, GA, May 7-10, 1996, pp. 303-306.
Imai Toru
Schwartz Richard M.
GTE Internetworking Incorporated
Sax Robert Louis
Suchyta Leonard Charles
Tsang Fan
LandOfFree
Topic indexing method does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Topic indexing method, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Topic indexing method will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2591696