System for chinese tokenization and named entity recognition

Data processing: speech signal processing – linguistics – language – Linguistics – Natural language

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S008000, C704S251000, C704S257000

Reexamination Certificate

active

06311152

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to the field of natural language processing, and in particular to systems for tokenizing and recognizing named entities in a text corpus of an ideographic language.
BACKGROUND
Natural language processing is an area of technology experiencing active research interest. In particular, significant activity has been undertaken in respect of the English language with positive results. However, little activity has been reported for ideographic languages such as Chinese. In an ideographic language, a word is made of one or more ideograms, where each ideogram is a symbol representing something such as an object or idea without expressing its sound(s).
The task of tokenizing ideographic languages such as Chinese and recognizing named entities (i.e., proper names) is more difficult that of the English language for a number of reasons. Firstly, unlike English, there are no boundaries between words in Chinese text. For example, a sentence is often a contiguous string of ideograms, where one or more ideograms may form a word, without spaces between “words” . Secondly, the uniformity of character strings in the Chinese writing system does not indicate proper names. In the English language, capitalization indicates proper names. The capitalized feature of proper names in English provides important information on the location and boundary of proper names in a text corpus.
Therefore, a need clearly exists for a system for tokenization and named-entity recognition of ideographic language.
SUMMARY
In accordance with a first aspect of the invention, there is disclosed a method of tokenization and named entity recognition of ideographic language. The method includes the steps of generating a word lattice for a string of ideographic characters using finite state grammars and a system lexicon, generating segmented text by determining word boundaries in the string of ideographic characters using the word lattice dependent upon a contextual language model and one or more entity language models; and recognizing one or more named entities in the string of ideographic characters using the word lattice dependent upon the contextual language model and the one or more entity language models.
Preferably, the method further includes the step of combining the contextual language model and the one or more entity language models. The contextual language model and the one or more entity language models may each be class-based language models.
Preferably, the contextual language model and the one or more entity language models incorporate local and contextual linguistic information, respectively, for producing prioritized word and corresponding category sequences. The contextual language model and the one or more entity language models may be dependent upon an n-gram paradigm.
Preferably, the lexicon includes single ideographic characters, words, and predetermined features of the characters and words. The lattice-generating step may include the step of generating one or more elements of the lattice using the lexicon.
Optionally, the finite state grammars are a dynamic and complementary extension of the lexicon for creating named entity hypotheses. The finite state grammars may run on the predetermined features contained in the lexicon to suggest possible entities, entity boundaries and entity categories.
In accordance with a second aspect of the invention, there is disclosed an apparatus for tokenization and named entity recognition of ideographic language, the apparatus including: a device for generating a word lattice for a string of ideographic characters using finite state grammars and a system lexicon; a device for generating segmented text by determining word boundaries in the string of ideographic characters using the word lattice dependent upon a contextual language model and one or more entity language models; and a device for recognizing one or more named entities in the string of ideographic characters using the word lattice dependent upon the contextual language model and the one or more entity language models.
In accordance with a third aspect of the invention, there is disclosed a computer program product having a computer readable medium having a computer program recorded therein for tokenization and named entity recognition of ideographic language. The computer program product includes: a module for generating a word lattice for a string of ideographic characters using finite state grammars and a system lexicon; a module for generating segmented text by determining word boundaries in the string of ideographic characters using the word lattice dependent upon a contextual language model and one or more entity language models, and a module for recognizing one or more named entities in the string of ideographic characters using the word lattice dependent upon the contextual language model and the one or more entity language models.


REFERENCES:
patent: 5109509 (1992-04-01), Katayama et al.
patent: 5212730 (1993-05-01), Wheatley et al.
patent: 5819265 (1998-10-01), Ravin et al.
patent: 5832480 (1998-11-01), Byrd et al.
patent: WO 97/40453 (1997-10-01), None
patent: WO 97/41680 (1997-11-01), None
Cucchiarelli et al. “Automatic Semantic Tagging of Unknown Proper Names”, Proceedings of Coling-ACL'98, pp286-292, 1998.
Borthwick et al. “Exploiting Diverse Knowledge Source via Maximum Entropy in Named Entity Recognition”, pp152-160, Proceedings of Sixth Workshop on Very Large Corpora, 1998.
Shuanghu et al. “Building class-based language models with contexual statistics”, Proceedings of ICASSP'98, pp173-176, Seattle Washington, USA, 1998.
Bikel et al. “Nymble: a High-Performance Learning Name-Finder” cmp-lg/9803003.
Nam et al. “A Local Grammar-based Approach to Recognising of Proper Names in Korean Text”, Proceedings of Fifth Workshop on Very Large Corpora, pp273-288, 1997.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System for chinese tokenization and named entity recognition does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System for chinese tokenization and named entity recognition, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System for chinese tokenization and named entity recognition will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2579141

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.