Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission
Reexamination Certificate
2001-02-12
2002-12-03
Knepper, David D. (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
For storage or transmission
C704S236000, C704S267000, C704S503000
Reexamination Certificate
active
06490553
ABSTRACT:
BACKGROUND OF THE INVENTION
Many challenges exist in the efficient production of closed captions, or, more generally, time-aligned transcripts. Closed captions are the textual transcriptions of the audio track of a television program, and they are similar to subtitles for a movie show.
A closed caption (or CC) is typically a triplet of (sentence, time value and duration). The time value is used to decide when to display the closed caption on the screen, and the duration is used to determine when to remove it. Closed captions are either produced off-line or on-line. Off-line closed captions are edited and aligned precisely with respect to time by an operator in order to appear on the screen at the precise moment the words are spoken. On-line closed captions are generated live, during television newscasts for instance.
Captions can be displayed on the screen in different styles: pop on, roll-up or paint-on. Pop-on closed captions appear and disappear at once. Because they require precise timing, they are created post-production of the program. Roll-up closed captions scroll up within a window of three or four lines. This style is typically used for live broadcasts, like news. In that case, an operator who uses a stenotype keyboard enters the caption content live. The paint-on captions have a similar style to pop-on captions, except they are painted on top of the existing captions, one character at a time.
Captioning a video program is a costly and time-consuming process which costs approximately $1,000 per hour. That includes the whole service from transcription, time alignments and text editing to make the captions comfortable to read.
The number of closed-captioned programs increased dramatically in the United States because of new federal laws:
The landmark Americans with Disabilities Act (or ADA) of 1992 makes broadcasts accessible to the deaf and hard-of-hearing;
The FCC Order #97-279 requires that 95% of all new broadcast programs be closed captioned by 2006.
The TV Decoder Circuitry Act which imposes all televisions 13 inches or larger for sale in the United States to have a closed caption decoder built in.
In several other countries, legislation requires television programs to be captioned. On the other hand, digital video disks (DVD) have multi-lingual versions and often require subtitles in more than one language for the same movie. Because of the recent changes in legislation and new support for video, the demand for captioning and subtitling has increased tremendously.
The current systems used to produce closed captions are fairly primitive. They mostly focus on formatting the text into captions, synchronizing them with the video and encoding the final videotape. The text has to be transcribed first, or at best imported from an existing file. This is done in one of several ways: the typist can use a PC with a standard keyboard or stenotype keyboard such as those used by court reporters. At this point of the process, the timing information has been lost and must be rebuilt. Then the closed captions are made from the transcription by splitting the text manually in a word processor. This segmentation can be based on the punctuation, or is determined by the operator. At that point, breaks do not make any assumption on how the text has been spoken unless the operator listens to the tape while proceeding. The closed captions are then positioned on the screen and their style (italics, colors, uppercase, etc.) is defined. They may appear at different locations depending on what is already on the screen. Then the captions are synchronized with the audio. The operator plays the video and hits a key as soon as the first word of the caption has been spoken. At last, the captions are encoded on the videotape using a caption encoder.
In summary, the current industry systems work as follows:
Import transcription from word processor or use built-in word processor to input text;
Break lines manually to delimit closed captions;
Position captions on screen and define their style,
Time mark the closed captions manually while the audio track is playing;
Generate the final captioned videotape.
Thus, improvements are desired.
SUMMARY OF THE INVENTION
The parent invention provides an efficient system for producing off-line closed captions (i.e., time-aligned transcriptions of a source audio track). Generally, that process includes:
1. classifying the audio and selecting spoken parts only, generating non-spoken captions if required;
2. transcribing the spoken parts of the audio track by using an audio rate control method;
3. adding time marks to the transcription text using time of event keystrokes;
4. re-aligning precisely the transcription on the original audio track; and
5. segmenting transcription text into closed captions.
The present invention is directed to the audio rate control method of step 2, and in particular provides a method and apparatus for controlling rate of playback of audio data. Preferably using speech recognition, the rate of speech of the audio data is determined. The determined rate of speech is compared to a target rate. Based on the comparison, the playback rate is adjusted, i.e. increased or decreased, to match the target rate.
The target rate may be predefined or indicative of rate of transcription by a transcriber.
The playback rate is adjusted in a manner free of changing pitch of the corresponding speech.
Time domain or frequency domain techniques may be employed to effect adjustment of the playback rate. The time domain techniques may include interval sampling and/or silence removal.
REFERENCES:
patent: 3553372 (1971-01-01), Wright et al.
patent: 4841387 (1989-06-01), Rindfuss
patent: 4924387 (1990-05-01), Jeppesen
patent: 5564005 (1996-10-01), Weber et al.
patent: 5649060 (1997-07-01), Ellozy et al.
patent: RE35658 (1997-11-01), Jeppesen
patent: 5737725 (1998-04-01), Case
patent: 5748499 (1998-05-01), Trueblood
patent: 5793948 (1998-08-01), Asahi et al.
patent: 5828994 (1998-10-01), Covell et al.
patent: 5835667 (1998-11-01), Wactlar et al.
patent: 6023675 (2000-02-01), Bennett et al.
patent: 6076059 (2000-06-01), Glickman et al.
patent: 6161087 (2000-12-01), Wightman et al.
patent: 6181351 (2001-01-01), Merrill et al.
patent: 6185329 (2001-02-01), Zhang et al.
patent: 6205420 (2001-03-01), Takagi et al.
patent: 6260011 (2001-07-01), Heckerman et al.
patent: 6263507 (2001-07-01), Ahmad et al.
Covell et al., “MACH1: Nonuniform Time-Scale Modification of Speech,” 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, May 1998, vol. 1, pp. 349 to 352.*
Hejna, D.J., Jr., “Real-Time Time-Scale Modification of Speech via the Synchronized Overlap-Add Algorithm,” unpublished master's thesis, Massachusetts Institute of Technology (1990).
Hain, T., et al., “Segment Generation and Clustering in the HTK Broadcast News Transcription System,”Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998.
Siegler, M.A. et al., “Automatic Segmentation Classification and Clustering of Broadcast News Audio,”Proc. DARPA Speech Recognition Workshop, 1997.
Siegler, M.A. et al., “On the Effects of Speech Rate in Large Vocabulary Speech Recognition Systems,”Proc. ICASSP, May 1995.
Campbell, W.N., “Extracting Speech-Rate Values from a Real-Speech Database,”Proc. ICASSP, Apr. 1988
Miller, G.A., et al., “The intelligibility of interrupted speech,”Journal of the Acoustic Society of America22(2) :167-173, 1950.
David, E.E. et al., “Note on pitch-synchronous processing of speech,”Journal of the Acoustic Society of America, 28(7):1261-1266, 1965.
Neuberg, E.E., “Simple pitch-dependent algorithm for high quality speech rate changing,”Journal of the Acoustic Society of America, 63(2):624-625, 1978.
Roucos, S., et al., “High quality time-scale modification for speech,”Proc. of the International Conference on Acoustics, Speech and Signal Processing, pp. 493-496, IEEE, 1985.
Malah, D., “Time-domain algorithms for harmonic bandwidth reduction and time scaling of speech signals,”IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-27
Pan Davis
Thong Jean-Manuel Van
Compaq Information Technologies Group L.P.
Hamilton Brook Smith & Reynolds P.C.
Knepper David D.
Lerner Martin
LandOfFree
Apparatus and method for controlling rate of playback of... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Apparatus and method for controlling rate of playback of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus and method for controlling rate of playback of... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2991561