Method and system for off-line detection of textual topical...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C704S009000

Reexamination Certificate

active

06529902

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to off-time topic (e.g., off-line) detection, and more particularly to the problem of off-line detecting topical changes and topic identification in texts for use in such practical applications as improving automatic speech recognition and machine translation.
2. Description of the Related Art
Conventional methods exist for the above-mentioned topic identification problem. However, hitherto the present invention, no suitable method has been employed even though such can be useful in various off-line tasks such as textual data mining, automatic speech recognition, machine translation, etc. This problem requires solution of a segmentation task that present independent interest.
There exist some methods for dealing with the problem of text segmentation. In general, the conventional approaches fall into two classes:
1) Content-based methods, which look at topical information such as n-grams or IR similarity measures; and
2) Structure or discourse-based methods, which attempt to find features that characterize story opening and closings.
Several approaches on segmentation are described in Jonathan P. Yamron “Topic Detection and Tracking Segmentation Task”,
Proceedings of the Topic Detection and Tracking Workshop,
University of Maryland, October 1997. This paper describes a content-based approach that exploits the analogy to speech recognition, allowing segmentation to be treated as a Hidden Markov Model (HMM) process.
More precisely, in this approach the following concepts are used: 1) stories are interpreted as instances of hidden underlying topics; 2) the text stream is modeled as a sequence of these topics, in the same way that an acoustic stream is modeled as a sequence of words or phonemes; and 3) topics are modeled as simple unigram distributions.
There exist several approaches for topic detection that have been described in a workshop (e.g., see DARPA,
Broadcast News Translation and Understanding Workshop,
Feb. 8-11, 1998). Some of them (e.g., “Japanese Broadcast News Transcription and Topic Detection”, Furui, et al., in DARPA,
Broadcast News Translation and Understanding Workshop,
Feb. 8-11, 1998) require all words in an article to be presented in order to identify a topic of the article. A typical approach for topic identification is to use key words for a topic and count frequencies of key words to identify a topic (see for example “Japanese Broadcast News Transcription and Topic Detection”, Furui, et al., in DARPA,
Broadcast News Translation and Understanding Workshop,
Feb. 8-11, 1998).
Recently, a method for realtime topic detection that is based on likelihood ratio was described in “Real time detection of textual topical changes and topic identification via likelihood based methods”, Kanevsky, et al., commonly-assigned U.S. patent application Ser. No. 09/124,075, filed on Jul. 29, 1998 incorporated herein by reference.
However, the above-mentioned methods have not been very successful in detection of the topical changes present in the data.
For example, model-based segmentation and the metric-based segmentation rely on thresholding of measurements which lack stability and robustness. Besides, the model-based segmentation does not generalize to unseen textual features. Concerning textual segmentation via hierarchical clustering, this approach is problematic in that it is often difficult to determine the number of clusters of words to be used in the initial phase.
All of these methods lead to a relatively high segmentation error rate and, as consequence, lead to a confusing/confusable topic labeling. There are no descriptions of how confusability in topic identification could be resolved when topic labelling is needed for such application tasks as text mining, or for improving a language model in off-line automatic speech recognition decoding or machine translation.
Concerning known topical identification methods, one of their deficiencies is that they are not suitable for realtime tasks since they require all data to be presented.
Another deficiency is their reliance on several key words for topic detection. This makes realtime topic detection difficult since key words are not necessarily present at the onset of the topic. Thus, the sample must be processed to near its conclusion before a topic detection is made possible.
Yet another problem with “key words” is that a different topic affects not only the frequencies of key words but also the frequencies of other (non-key) words. Exclusive use of key words does not allow one to measure the contribution of other words in topic detection.
Concerning “cumulative sum” (CUSUM)-based methods that are described in the above-mentioned U.S. patent application Ser. No. 09/124,075, since these methods are realtime-based they use a relatively short segments to produce probabilities scores to establish changes in a likelihood ratio. These methods also must use various stopping criteria in order to abandon a current line of segmentations and identification. This also can lead to detecting topic changes too late or too early.
Another problem with existing methods is that they tend to be extremely computing-intensive, resulting in an extremely high burden on the supporting hardware.
SUMMARY OF THE INVENTION
In view of the foregoing and other problems, disadvantages, and drawbacks of the conventional methods, an object of this invention is to provide an off-line segmentation of textual data that uses change-point methods.
Another object of the present invention is to perform off-line topic identification of textual data.
Yet another object of the present invention is to provide an improved language modeling for off-line automatic speech decoding and machine translation.
In a first aspect of the invention, a system (and method) for off-line detection of textual topical changes includes at least one central processing unit (CPU), at least one memory coupled to the at least one CPU, a network connectable to the at least one CPU, and a database, stored on the at least one memory, containing a plurality of textual data set of topics. The CPU executes first and second processes in forward and reverse directions, respectively, for extracting a segment having a predetermined size from a text, computing likelihood scores of a text in the segment for each topic, computing likelihood ratios, comparing them to a threshold, and defining whether to declare a change point at the current last word in the window.
In a second aspect, a method of detecting topical changes in a textual segment, includes evaluating text probabilities under each topic of a plurality of topics, and selecting a new topic when one of the text probabilities becomes larger than others of the text probabilities, wherein the topic detection is performed off-line.
In a third aspect, a storage medium is provided storing the inventive method.
The present invention solves the problem of detecting topical changes via application of “cumulative sum” (CUSUM)-based methods. The basic idea of the topic identification procedure is to evaluate text probabilities under each topic, and then to select a new topic when one of those probabilities becomes significantly larger than the others.
Since the topic detection is performed off-line, the inventive method can be enhanced by producing several different topic labels using different labeling strategies. One of such labeling strategies is to mark topics moving from an end of a text to a beginning. The special topic labeling is then chosen via evaluation of several evidences and factors that lead to different topic labeling. This special topic labeling can be applied to produce new topic scores that are needed for improving a language model in off-line automatic speech recognition decoding or machine translation.
Thus, the present invention can perform off-time topic detection by performing several steps. That is, the steps include segmentation of textual data into “homogenous” segments and topic (event) identification of a current segment using d

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and system for off-line detection of textual topical... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and system for off-line detection of textual topical..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for off-line detection of textual topical... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3049206

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.