Data processing: speech signal processing – linguistics – language – Linguistics – Dictionary building – modification – or prioritization
Reexamination Certificate
2000-03-30
2002-12-03
Edouard, Patrick N. (Department: 2654)
Data processing: speech signal processing, linguistics, language
Linguistics
Dictionary building, modification, or prioritization
C707S793000
Reexamination Certificate
active
06490549
ABSTRACT:
FIELD OF THE INVENTION
The invention generally relates to natural language processing, and more particularly, the automatic transformation of the orthography of a text stream such as the proper capitalization of words in a stream of text, especially with respect to automatic speech recognition.
BACKGROUND ART
Capitalized word forms in English can be divided into two main types: those that are determined by where the term occurs (or, positional capitalizations) and those that are determined by what the term denotes (or, denotational capitalizations). In English, positional capitalization occurs, for example, at the beginning of a sentence, or the beginning of quoted speech. Denotational capitalization is, to a first approximation, dependent upon whether the term or expression is a proper name.
Positional capitalization is straightforward; the rules governing positional capitalization are very clear. In the context of dictation and automatic speech recognition, sentence splitting is very accurate because the user must dictate the sentence-ending punctuation. By contrast, abbreviations and other phenomena make splitting written text into sentences a non-trivial task. In the context of dictation and automatic speech recognition, simple pattern matching allows one to do positional capitalization with near perfect accuracy.
Denotational capitalization is much harder to do automatically. Denotational capitalization can be viewed as the flip side of proper name recognition—an information extraction task for which the current state of the art reports about a 94% combined precision and recall over a restricted set of name types. In proper name recognition, the goal is to correctly determine which expressions refer to (the same) named entities in a text, using the words, their position and their capitalization. The goal is to use an expression and its context to determine if it is a proper name, and therefore, should be capitalized.
Existing speech recognition systems tend to make a large number of errors on capitalization—about 5-7% of dictated words, in English. Most of these errors are errors of denotational capitalization. The difficulty arises for terms which are both common nouns (or other uncapitalized words), and constituents of proper nouns, such as “Bill Gates” or “Black's Disease.”
SUMMARY OF THE INVENTION
Throughout the following description and claims, the term ‘tag’ is used to denote the properties that annotate a word or word phrase, including part of speech information. The term ‘feature’ is used in the maximum-entropy sense to mean the co-occurrence of certain items or properties.
A representative embodiment of the present invention includes a method of automatically rewriting the orthography of a stream of text words. If a word in the stream has an entry in an orthography rewrite lexicon, the word is automatically replaced with an orthographically rewritten form of the word from the orthography rewrite lexicon. In addition, selected words in the stream are compared to a plurality of features weighted by a maximum entropy-based algorithm, to automatically determine whether to rewrite orthography of any of the selected words. Orthographic rewriting may include properly capitalizing and/or abbreviating words in the stream of text words.
In a further embodiment, the method also includes, if a series of adjacent words in the stream has an entry in a phrase rewrite lexicon, replacing the series of adjacent words with a phrase form of the series of words from the phrase rewrite lexicon. Annotating linguistic tags may be associated with the orthographically rewritten form of the word. The method may also include providing linguistic tags to selected words in the stream, using context-sensitive rewrite rules to change the orthography of words in the stream based on their linguistic tags, and weighting the application of these rules in specific contexts according to maximum entropy weighting.
At least one of the features may be a context-dependent probability distribution representing a likelihood of a given word in a given context being in a given orthographic form. In a further embodiment, for each selected word, determining an orthographic rewrite probability representing a normalized product of the weighted features for that word, and if the orthographic rewrite probability is greater than a selected threshold probability, replacing that selected word with an orthographically rewritten form.
REFERENCES:
patent: 5467425 (1995-11-01), Lau et al.
patent: 5819265 (1998-10-01), Ravin et al.
patent: 6167368 (2000-12-01), Wacholder
Reynar et al.; “A Maximum Entropy Approach to Identifying Sentence Boundaries”, Proced. of the 5th Conf. on Applied Natural Language, 1997.*
Borthwick et al.; “Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition” 1998.*
Berger, et al, “A Maximum Entropy Approach to Natural Language Processing”,Association for Computational Linguistics, 1996, pp. 1-36.
Chen, et al, “A Gaussian Prior for Smoothing Maximum Entropy Models”, Technical Report CMUCS-99-108, Carnegie Mellon University, 1999.
Della Pietra, Stephen, et al, “Inducing Features of Random Fields”,IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997, vol. 19, No. 4, pp. 380-393.
Adams Jeffrey P.
Ulicny Brian
Vasserman Alex
Vozila Paul
Bromberg & Sunstein LLP
Edouard Patrick N.
ScanSoft, Inc.
LandOfFree
Automatic orthographic transformation of a text stream does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Automatic orthographic transformation of a text stream, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Automatic orthographic transformation of a text stream will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2997839