Data processing: speech signal processing – linguistics – language – Linguistics – Natural language
Reexamination Certificate
1998-10-23
2001-02-13
Isen, Forester W. (Department: 2747)
Data processing: speech signal processing, linguistics, language
Linguistics
Natural language
C704S001000, C704S255000
Reexamination Certificate
active
06188976
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates to building statistical language models that are pertinent to a specific domain or field.
Statistical language models are used heavily in speech recognition, natural language understanding and other language processing applications. Such language models are used by a computer to facilitate comprehension of a language processing task, akin to a human employing context to understand spoken language. For instance, a speech recognition program will use a language model to select among phonetically equivalent words such as “to”, “too” and “two”, when creating a transcription.
Generally, it is impractical to construct a language model that covers an entire spoken language, including specialized and technical fields. Such a language model requires large memory storage and presents a complex processing task. Hence, domain-specific language models have been developed which are tailored to a specific domain or field. For instance, a speech recognition program may be tailored specifically for: medical writings; legal writings; or to a user's spoken questions and commands during use of a particular Internet site (e.g., sports, travel); and so forth. The domain-specific language model approach conserves memory, reduces complexity of the processing task, and reduces the word-error rate as compared to general (domain-unrestricted) language models.
Building a language model usually requires a large amount of training data, which is burdensome to obtain. By way of example, training data for the language model component of a speech recognition program geared for medical dictation may be obtained by manual, human transcription of large volumes of dictation recorded from doctors. Because this is so time consuming, it is desirable to have a method for the construction of a domain-specific language model that uses a very small amount of training data.
A number of prior art techniques have attempted to resolve this problem by employing some form of class-based language modeling. In class-based language modeling, certain words are grouped into classes, depending on their meaning, usage or function. Examples of class-based modeling are disclosed in: Brown et al., “Class-Based N-Gram Models of Natural Language”,
Computational Linguistics
, Vol. 18, No. 4, pp. 467-479, 1992; and Frahat et al., “Clustering Words for Statistical Language Models Based on Contextual Word Similarity”,
IEEE International Conference on Acoustics, Speech and Signal Processing
, Vol. 1, pp. 180-183, Atlanta, May 1996.
Other conventional methods allowing for a reduction in the requisite training data employ some form of mixture modeling and task adaptation. See, for example, Crespo et al., “Language Model Adaptation for Conversational Speech Recognition using Automatically Tagged Pseudo-Morphological Classes”,
IEEE International Conference on Acoustics, Speech and Signal Processing
, Vol. 2, pp. 823-826, Munich, April 1997; Iyer et al., “Using Out-of-Domain Data to Improve In-Domain Language Models”,
IEEE Signal Processing Letters
, Vol. 4, No. 8, pp. 221-223, August 1997; and Masataki et “Task Adaptation Using MAP Estimation in N-Gram Language Modeling”,
IEEE International Conference on Acoustics, Speech and Signal Processing
, Vol. 2, pp. 783-786, Munich, April 1997.
Embodiments of the present invention to be described exhibit certain advantages over these prior art techniques as will become apparent hereafter.
SUMMARY OF THE DISCLOSURE
The present invention pertains to a method and apparatus for building a domain-specific language model for use in language processing applications, e.g., speech recognition. A reference language model is generated based on a relatively small seed corpus containing linguistic units relevant to the domain. An external corpus containing a large number of linguistic units is accessed. Using the reference language model, linguistic units of the external corpus which have a sufficient degree of relevance to the domain are extracted. The reference language model is then updated based on the seed corpus and the extracted linguistic units. The procedure can be repeated iteratively until the language model is of satisfactory quality.
REFERENCES:
patent: 5444617 (1995-08-01), Merialdo
patent: 5613036 (1997-03-01), Strong
patent: 5640487 (1997-06-01), Lau et al.
patent: 5899973 (1999-05-01), Bandara et al.
Placeway, P., “The Estimation of Powerful Language Models From Small and Large Corpora” IEEE 1993, pp. II-33-II-36.
Masataki et al., “Task Adaptation Using Map Estimation in N-Gram Language Modeling,” IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 783-786, Munich, Apr. 1997.
Crespo et al., “Language Model Adaptation for Conversational Speech Recognition Using Automatically Tagged Pseudo-Morphological Classes,” IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 823-826, Munich, Apr. 1997.
Farhat et al., “Clustering Words for Statistical Language Models Based on Contextual Word Similarity,” IEEE International Conference on Acoustics, Speech Pricessing, vol. 1, pp. 180-183, Atlanta, May 1996.
Iyer et al., “Using Out-Of-Domain Data to Improve In-Domain Language Models,” IEEE Signal Processing Letters, vol. 4, No. 8, pp. 221-223, Aug. 1997.
Issar, S., “Estimation of Language Models for New Spoken Language Applications,” International Conference on Spoken Language Processing, vol. 2, pp. 869-872, Philadelphia, Oct. 1996.
Brown et al., “Class-Based n-gram Models of Natural Language,” Computational Linguistics, vol. 18, No. 4, pp. 467-479, 1992.
Gopalakrishnan Ponani S.
Printz Harry W.
Ramaswamy Ganesh N.
Edouard Patrick N.
F. Chau & Associates LLP
International Business Machines - Corporation
Isen Forester W.
LandOfFree
Apparatus and method for building domain-specific language... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Apparatus and method for building domain-specific language..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus and method for building domain-specific language... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2611241