Apparatus and method for building domain-specific language...

Data processing: speech signal processing – linguistics – language – Linguistics – Natural language

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details Apparatus and method for building domain-specific language... Apparatus and method for building domain-specific language...

: 1998-10-23
: 2001-02-13
: Isen, Forester W. (Department: 2747)
: Data processing: speech signal processing, linguistics, language
: Linguistics
: Natural language

: C704S001000, C704S255000
: Reexamination Certificate
: active
: 06188976
: ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates to building statistical language models that are pertinent to a specific domain or field.
Statistical language models are used heavily in speech recognition, natural language understanding and other language processing applications. Such language models are used by a computer to facilitate comprehension of a language processing task, akin to a human employing context to understand spoken language. For instance, a speech recognition program will use a language model to select among phonetically equivalent words such as “to”, “too” and “two”, when creating a transcription.
Generally, it is impractical to construct a language model that covers an entire spoken language, including specialized and technical fields. Such a language model requires large memory storage and presents a complex processing task. Hence, domain-specific language models have been developed which are tailored to a specific domain or field. For instance, a speech recognition program may be tailored specifically for: medical writings; legal writings; or to a user's spoken questions and commands during use of a particular Internet site (e.g., sports, travel); and so forth. The domain-specific language model approach conserves memory, reduces complexity of the processing task, and reduces the word-error rate as compared to general (domain-unrestricted) language models.
Building a language model usually requires a large amount of training data, which is burdensome to obtain. By way of example, training data for the language model component of a speech recognition program geared for medical dictation may be obtained by manual, human transcription of large volumes of dictation recorded from doctors. Because this is so time consuming, it is desirable to have a method for the construction of a domain-specific language model that uses a very small amount of training data.
A number of prior art techniques have attempted to resolve this problem by employing some form of class-based language modeling. In class-based language modeling, certain words are grouped into classes, depending on their meaning, usage or function. Examples of class-based modeling are disclosed in: Brown et al., “Class-Based N-Gram Models of Natural Language”,
Computational Linguistics
, Vol. 18, No. 4, pp. 467-479, 1992; and Frahat et al., “Clustering Words for Statistical Language Models Based on Contextual Word Similarity”,
IEEE International Conference on Acoustics, Speech and Signal Processing
, Vol. 1, pp. 180-183, Atlanta, May 1996.
Other conventional methods allowing for a reduction in the requisite training data employ some form of mixture modeling and task adaptation. See, for example, Crespo et al., “Language Model Adaptation for Conversational Speech Recognition using Automatically Tagged Pseudo-Morphological Classes”,
IEEE International Conference on Acoustics, Speech and Signal Processing
, Vol. 2, pp. 823-826, Munich, April 1997; Iyer et al., “Using Out-of-Domain Data to Improve In-Domain Language Models”,
IEEE Signal Processing Letters
, Vol. 4, No. 8, pp. 221-223, August 1997; and Masataki et “Task Adaptation Using MAP Estimation in N-Gram Language Modeling”,
IEEE International Conference on Acoustics, Speech and Signal Processing
, Vol. 2, pp. 783-786, Munich, April 1997.
Embodiments of the present invention to be described exhibit certain advantages over these prior art techniques as will become apparent hereafter.
SUMMARY OF THE DISCLOSURE
The present invention pertains to a method and apparatus for building a domain-specific language model for use in language processing applications, e.g., speech recognition. A reference language model is generated based on a relatively small seed corpus containing linguistic units relevant to the domain. An external corpus containing a large number of linguistic units is accessed. Using the reference language model, linguistic units of the external corpus which have a sufficient degree of relevance to the domain are extracted. The reference language model is then updated based on the seed corpus and the extracted linguistic units. The procedure can be repeated iteratively until the language model is of satisfactory quality.

REFERENCES:
patent: 5444617 (1995-08-01), Merialdo
patent: 5613036 (1997-03-01), Strong
patent: 5640487 (1997-06-01), Lau et al.
patent: 5899973 (1999-05-01), Bandara et al.
Placeway, P., “The Estimation of Powerful Language Models From Small and Large Corpora” IEEE 1993, pp. II-33-II-36.
Masataki et al., “Task Adaptation Using Map Estimation in N-Gram Language Modeling,” IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 783-786, Munich, Apr. 1997.
Crespo et al., “Language Model Adaptation for Conversational Speech Recognition Using Automatically Tagged Pseudo-Morphological Classes,” IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 823-826, Munich, Apr. 1997.
Farhat et al., “Clustering Words for Statistical Language Models Based on Contextual Word Similarity,” IEEE International Conference on Acoustics, Speech Pricessing, vol. 1, pp. 180-183, Atlanta, May 1996.
Iyer et al., “Using Out-Of-Domain Data to Improve In-Domain Language Models,” IEEE Signal Processing Letters, vol. 4, No. 8, pp. 221-223, Aug. 1997.
Issar, S., “Estimation of Language Models for New Spoken Language Applications,” International Conference on Spoken Language Processing, vol. 2, pp. 869-872, Philadelphia, Oct. 1996.
Brown et al., “Class-Based n-gram Models of Natural Language,” Computational Linguistics, vol. 18, No. 4, pp. 467-479, 1992.

Affiliated with

Gopalakrishnan Ponani S.

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Printz Harry W.

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Ramaswamy Ganesh N.

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

Edouard Patrick N.

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

F. Chau & Associates LLP

Law Firm

[ 0.00 ] – not rated yet Voters 0 Comments 0

International Business Machines - Corporation

Corporate Assignee

[ 0.00 ] – not rated yet Voters 0 Comments 0

Isen Forester W.

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Apparatus and method for building domain-specific language... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Apparatus and method for building domain-specific language..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus and method for building domain-specific language... will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-2611241

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure