Scalable mixing for speech streaming

Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C370S267000, C379S202010

Reexamination Certificate

active

06230130

ABSTRACT:

FIELD OF THE INVENTION
The invention relates to a method and a system for processing audio, using mixing of multiple concurrent streams of audio data. The invention relates in particular, but not exclusively, to the mixing of multiple concurrent streams of speech data.
BACKGROUND ART
Artificial processing of speech typically uses a digital representation of the data because of its robustness against distortion. Digital processing further allows streaming of data. Streaming enables audio data, such as speech data, to be compressed on the fly so that real time communication is possible, instead of requiring to wait for a file or a portion of it to download before acquiring access to it. For an introduction to speech processing, see, e.g., Speech Coding and Synthesis, edited by W. B. Kleijn and K. K. Paliwal, Elsevier, 1995, especially pp. 1-47, incorporated herein by reference .
Mixing of speech streams is required at a receiver when multiple speech streams must be rendered and played out through a single audio device. Mixing of speech streams is also desired at an intermediate point in the transmission path (e.g., at a server in a client-server architecture) when multiple speech streams are available that are to be combined into a single stream or into a reduced number of streams for retransmission to a particular receiver or to a group of receivers.
Mixing of multiple streams at the receiver requires the decoded streams to be rendered to produce the signals that are to be played out of the loudspeakers. The rendering function for each stream is defined by the application, and can range from simple duplication for monophonic reproduction through a set of two loudspeakers, to a complicated transfer function for providing loudspeaker compensation and for spatial localization of each sound source.
OBJECT OF THE INVENTION
It is an object of the invention to provide procedures for mixing multiple streams that reduce the processing power required with respect to existing schemes. It is another object to provide mixing procedures that reduce the bandwidth required with respect to existing schemes. It is yet another object to provide architectures that are scalable with respect to bandwidth and/or processing power.
SUMMARY OF THE INVENTION
To this end, the invention provides a method of audio processing. The method comprises mixing multiple concurrent audio streams. Each respective one of the streams comprises a respective sequence of frames. The method comprises the following steps. A subset of specific frames is selected from among the concurrent frames. Upon selection, the specific frames of the subset are decoded and rendered for producing specific signals. The specific signals are then mixed.
Preferably, the selection criterion involves a quantity that is inherent in each of the concurrent frames. In parametric coding schemes, for example, a particular frame comprises a parameter representing the frame's energy content, or a parameter indicating whether or not the frame relates to voiced or unvoiced speech. Alternatively or supplementarily, a parameter indicates a pitch. Alternatively or supplementarily, the amplitudes can be retrieved and added together to create another measure. Based on these quantities, possibly after additional weighting, the concurrent frames are ranked according to importance and the invention selects those frames for the decoding that are the most important. To give a more specific example, the selection criterion may be a mathematical relationship between the energy content and a rendering gain. The rendering gain is explained as follows. The decoded streams are to be rendered to produce the signals as played out by loudspeakers. The rendering gain is a quantity that represents the effect of the rendering on the perceptual intensity of the signal source. The rendering gain can be set to anything desired by the application developer. For example, the rendering gain is set to the sum of the energy gains from the decoded signal to each of the loudspeaker signals when rendering white noise (i.e., the sum of the energy of the impulse responses of the renderer.
The selection may involve respective priorities assigned to a respective one of the streams by the user or by the application developer. The priorities are independent of perceptual considerations. The selection step then creates a subset based on the priorities only or on the combinations of rendered energy and priority.
A variety of ways can be used to define concurrence. For example, concurrence of the frames can be determined on, e.g., time-stamping. As an other example, the concurrent frames are those frames that are present at the input of the selection step at the time the selecting is started. Buffering can be used to assist in the latter mode of operation to achieve temporal alignment.
Preferably, decoding is achieved through the use of sinusoidal decoder circuitry whose operation is based on overlap-add synthesis to limit audible artifacts. For sinusoidal coders see, e.g., U.S. Pat. Nod. 4,771,465 and 4,797,926, herewith incorporated by reference. For overlap-add synthesis, also in combination with sinusoidal decoders, see, e.g., U.S. Pat. No. 5,327,518 herewith incorporated by reference. Typically, the energy content of a frame is easier to obtain than decoding the entire frame. For example, a variety of coding schemes, e.g., linear-prediction coding and aforesaid sinusoidal coding, involve the transmission of a parameter representative of the signal power or energy per frame, along with the content data, see, e.g., Kleijn and Paliwal, cited supra, Chapter 1, especially pp. 36 and 37, and aforesaid U.S. Pat. No. 4,771,465. Accordingly, the energy content of a frame is readily available in order to carry out the selection of the specific frames without the need for extra processing power.


REFERENCES:
patent: 4771465 (1988-09-01), Bronson et al.
patent: 4797926 (1989-01-01), Bronson et al.
patent: 5327518 (1994-07-01), George et al.
patent: 5539741 (1996-07-01), Barraclough et al.
patent: 5619197 (1997-04-01), Nakamura
patent: 5646931 (1997-07-01), Terasaki
patent: 5659663 (1997-08-01), Lin
patent: 5703794 (1997-12-01), Heddle, et al.
patent: 5890017 (1999-03-01), Tulkhoff et al.
patent: 5963153 (1999-10-01), Rosefield et al.
patent: 5986589 (1999-11-01), Rosefield et al.
patent: 6008838 (1999-12-01), Iizawa
patent: 6098317A (1994-04-01), None
patent: 7264570A (1995-10-01), None
patent: 8032950A (1996-02-01), None
patent: 9710674A1 (1997-03-01), None
Speech Coding and Synthesis, ed. W.B. Kleijn and K.K. Paliwal, Elsevier, 1995, especially pp.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Scalable mixing for speech streaming does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Scalable mixing for speech streaming, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Scalable mixing for speech streaming will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2528613

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.