Proper name identification in chinese

Image analysis – Pattern recognition – Ideographic characters

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S177000, C382S187000, C382S218000, C704S009000, C707S793000

Reexamination Certificate

active

06694055

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates generally to the field of natural language processing. More specifically, the present invention relates to word segmentation.
Word segmentation refers to the process of identifying individual words that make up an expression of language, such as in written text. Word segmentation is useful for checking spelling and grammar, synthesizing speech from text, speech recognition, information retrieval and performing natural language parsing and understanding.
Performing word segmentation of English text is rather straight forward, because spaces and punctuation marks generally delimit individual words in the text. However, in Chinese text, word boundaries are implicit rather than explicit. Consider the sentence in Table 1 below:
TABLE 1
Despite the lack of punctuation and spaces in the sentence, a reader of Chinese would recognize the sentence in Table 1 as being comprised of the words shown below:
TABLE 2
Wang
Kaiwen
come from
Nanjing
where
can be treated as a single word (i.e. a proper name).
As shown above, proper names are written in ordinary Chinese characters with no special markings such as capitalization in English or in other European languages. In addition, there are no spaces or blanks in the text to separate proper names from other words. Chinese names also use characters that can form parts of other words, or can function as other nouns, verbs or adjectives in a different context. As a result, proper names are “hidden” in Chinese text, which creates a serious problem for the processing of Chinese text. It has been estimated that about 2% of average Chinese text are proper names, but they are the cause of at least 50% of errors made by state-of-art segmentation systems. Therefore, an accurate and efficient approach to automatically perform segmentation with proper name recognition would have significant utility.
SUMMARY OF THE INVENTION
A first aspect of the present invention is a word segmentation method to identify proper names in input text. The method includes locating a sequence of single-characters in the input text not forming a part of a multiple-character word. The method further includes comparing the sequence of single-characters to a lexical knowledge base to identify if a first portion of the sequence corresponds to stored identifiable portions of a proper name, and comparing the sequence of single-characters to the lexical knowledge base to identify if a second portion of the sequence proximate the first portion includes characters known to comprise a second portion of a proper name.
A second aspect of the present invention is a method to identify non-Chinese originated names contained in Chinese text. The method includes locating a sequence of three or more single-characters in input text not forming a part of a multiple-character word, and comparing the sequence of single-characters to a lexical knowledge base to identify if characters contained in the sequence correspond to characters used in non-Chinese originated names.
A third aspect of the present invention includes a method for creating a lexical knowledge base for identifying proper names in input text. The method includes comparing a list of full proper names to be identified and a list of known portions of the full proper names and removing from each of the proper names any known portions contained therein to obtain a list comprising remaining portions of the full proper names. Indications are stored in the lexical knowledge base for the list of full proper names, for the list of known portions of the full proper names, for the list of remaining portions of the full proper names and positional information of characters in each of the remaining portions of the full proper names.
Instructions can be provided on a computer readable medium to implement any of the above-mentioned methods.
A fourth aspect of the present invention is a computer readable medium comprising a lexical knowledge base for use in identifying proper names in input text. The lexical knowledge base includes, for each of a plurality of words, an indication that the word corresponds to a first portion of a proper name, and for each of a plurality of characters, an indication that the character is a part of a second portion of a proper name.
A fifth aspect of the present invention is a computer readable medium comprising a lexical knowledge base for using in identifying non-Chinese originated names in Chinese names. The lexical knowledge base includes, for each of a plurality of characters, an indication that the character is a part of a non-Chinese originated name.


REFERENCES:
patent: 4750122 (1988-06-01), Kaji et al.
patent: 4850026 (1989-07-01), Jeng et al.
patent: 4887212 (1989-12-01), Zamora et al.
patent: 4942526 (1990-07-01), Okajima et al.
patent: 5029084 (1991-07-01), Morohasi et al.
patent: 5077804 (1991-12-01), Richard
patent: 5448474 (1995-09-01), Zamora
patent: 5454046 (1995-09-01), Carman, II
patent: 5473607 (1995-12-01), Hausman et al.
patent: 5651095 (1997-07-01), Ogden
patent: 5694523 (1997-12-01), Wical
patent: 5740549 (1998-04-01), Reilly et al.
patent: 5787197 (1998-07-01), Beigi et al.
patent: 5806021 (1998-09-01), Chen et al.
patent: 5850480 (1998-12-01), Scanlon
patent: 5917941 (1999-06-01), Webb et al.
patent: 5923778 (1999-07-01), Chen et al.
patent: 5933525 (1999-08-01), Makhoul et al.
patent: 5940532 (1999-08-01), Tanaka
patent: 6014615 (2000-01-01), Chen
patent: 6035268 (2000-03-01), Carus et al.
patent: 6073146 (2000-06-01), Chen
patent: 6173253 (2001-01-01), Abe et al.
patent: 6182029 (2001-01-01), Friedman
patent: 6298343 (2001-10-01), Chang et al.
patent: 6363342 (2002-03-01), Shaw et al.
patent: 6374210 (2002-04-01), Chu
patent: 0 653 736 (1988-05-01), None
patent: 0 650 306 (1994-10-01), None
patent: WO 95/12955 (1995-05-01), None
patent: WO 97/17682 (1997-05-01), None
patent: WO 97/35402 (1997-09-01), None
patent: WO 98/08169 (1998-02-01), None
Coates-Stephens “The Analysis and Acquisition of Proper Names for the Understanding of Free Text”, Computers and the Humanities, vol. 26, 441-456, 1993.*
Yhap, et al. “An On-Line Chinese Character Recognition System”, IBM J. Res. Develop. vol. 25, No. 3, pp. 187-189, May 1991.*
Coates-Stephens “The Analysis and Acquisition of Proper Names for Robust Text Understanding”, Dept. of Computer Science, City University, London, England, Oct. 1992, pp. 1-8, 28-38, 113-133, and 200-206.*
“Automatic Recognition of Person Names Based on Corpus and Rule-Base” (English translation is provided), taken from Phrases, Articles and Tools (A Collection of Theses of Chinese Information Processing) 1991-1998 The Artificial Intelligence Lab of Beijing Computer Institute Nov. 1998.
Kuo, et al. “A New Method for the Segmentation of Mixed Handprinted Chinese/English Characters”, IEEE, pp. 810-813, 1993.
Lua, et al. “An application of Information Theory in Chinese Word Segmentation”, Computer Pocessing of Chinese & Oriental Languages, pp. 1-9, 1997.
Chen et al. “Chinese Text Retrieval Without Using Distionary”, ACM, pp. 42-49, 1997.
Palmer et al. “Chinese Word Segmentation and Information Retrieval”, AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, pp. 1-6, 1997.
Packard New Approaches to Chinese Word Formation: Morphology, Phonology and the Lexicon in Modern and Ancient Chinese. Mouton de Gruyter, New York,1998.
Chi et al. Word Segmentation and Recognition for Web Document Framework: ACM, pp. 458-465, Jan. 1999.
GE et al. Discovering Chinese Words from an Unsegmented Text, ACM, pp. 271-272, Jan. 1999.
Chen et al., “Word Identification for Mandarin Chinese Sentences”, Proceedings of the 14th International Conference on Computational Linguistics (Coling '92), pp. 101-107, Nantes, France.
Wu et al., “Chinese Text Segmentation for Text Retrieval: Achievements and Problems”, Journal of the American Society for Information Science, 44(9):532-542, 1993.
Chang et al., “A Multiple-Corpus Approach to Recognition of Proper Names in Chinese Texts”, Computer Processing of Chinese and Oriental Languages, vol. 8,

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Proper name identification in chinese does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Proper name identification in chinese, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Proper name identification in chinese will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3330382

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.