Word segmentation in chinese text

Image analysis – Image segmentation – Segmenting individual characters or words

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S185000

Reexamination Certificate

active

06640006

ABSTRACT:

TECHNICAL FIELD
The invention relates generally to the field of natural language processing, and, more specifically, to the field of word segmentation.
BACKGROUND OF THE INVENTION
Word segmentation refers to the process of identifying the individual words that make up an expression of language, such as text. Word segmentation is useful for checking spelling and grammar, synthesizing speech from text, and performing natural language parsing and understanding, all of which benefit from an identification of individual words.
Performing word segmentation of English text is rather straightforward, since spaces and punctuation marks generally delimit the individual words in the text. Consider the English sentence in Table 1 below.
TABLE 1
The motion was then tabled--that is, removed
indefinitely from consideration.
By identifying each contiguous sequence of spaces and/or punctuation marks as the end of the word preceding the sequence, the English sentence in Table 1 may be straightforwardly segmented as shown in Table 2 below.
TABLE 2
The motion  was  then  tabled 
--
that  is,  removed
indefinitely  from  consideration
.
In Chinese text, word boundaries are implicit rather than explicit. Consider the sentence in Table 3 below, meaning “The committee discussed this problem yesterday afternoon in Buenos Aires.”
TABLE 3
Despite the absence of punctuation and spaces from the sentence, a reader of Chinese would recognize the sentence in Table 3 as being comprised of the words separately underlined in Table 4 below.
TABLE 4
It can be seen from the examples above that Chinese word segmentation cannot be performed in the same manner as English word segmentation. An accurate and efficient approach to automatically performing Chinese segmentation would nonetheless have significant utility.
SUMMARY OF THE INVENTION
In accordance with the invention, a word segmentation software facility (“the facility”) provides word segmentation services for text in unsegmented languages such as Chinese by (1) evaluating the possible combinations of characters in an input sentence and discarding those unlikely to represent words in the input sentence, (2) looking up the remaining combinations of characters in a dictionary to determine whether they may constitute words, and (3) submitting the combinations of characters determined to be words to a natural language parser as alternative lexical records representing the input sentence. The parser generates a syntactic parse tree representing the syntactic structure of the input sentence, which contains only those lexical records representing the combinations of characters certified to be words in the input sentence. When submitting the lexical records to the parser, the facility weights the lexical records so that longer combinations of characters, which more commonly represent the correct segmentation of a sentence than shorter combinations of characters, are considered by the parser before shorter combinations of characters.
In order to facilitate discarding combinations of characters unlikely to represent words in the input sentence, the facility adds to the dictionary, for each character occurring in the dictionary, (1) indications of all of the different combinations of word length and character position in which the word appears, and (2) indications of all of the characters that may follow this character when this character begins a word. The facility further adds (3) indications to multiple-character words of whether sub-words within the multiple-character words are viable and should be considered. In processing a sentence, the facility discards (1) combinations of characters in which any character is used in a word length/position combination not occurring in the dictionary, and (2) combinations of characters in which the second character is not listed as a possible second character of the first character. The facility further discards (3) combinations of characters occurring in a word for which sub-words are not to be considered.
In this manner, the facility both minimizes the number of character combinations looked up in the dictionary and utilizes the syntactic context of the sentence to differentiate between alternative segmentation results that are each comprised of valid words.


REFERENCES:
patent: 3969700 (1976-07-01), Bollinger et al.
patent: 4750122 (1988-06-01), Kaji et al.
patent: 4850026 (1989-07-01), Jeng et al.
patent: 4942526 (1990-07-01), Okajima et al.
patent: 5077804 (1991-12-01), Richard
patent: 5299125 (1994-03-01), Baker et al.
patent: 5448474 (1995-09-01), Zamora
patent: 5454046 (1995-09-01), Carman, II
patent: 5694523 (1997-12-01), Wical
patent: 5787197 (1998-07-01), Beigi et al.
patent: 5806021 (1998-09-01), Chen et al.
patent: 5850480 (1998-12-01), Scanlon
patent: 5917941 (1999-06-01), Webb et al.
patent: 5923778 (1999-07-01), Chen et al.
patent: 5933525 (1999-08-01), Makhoul et al.
patent: 5940532 (1999-08-01), Tanaka
patent: 6014615 (2000-01-01), Chen
patent: 94112998.5 (1995-12-01), None
patent: WO 98/08169 (1998-02-01), None
patent: WO 99/62001 (1999-12-01), None
Chen et al., “Word Identification for Mandarin Chinese Sentences”, Proceedings of the 14th International Conference on Computational Linguistics (Coling '92), pp. 101-107, Nantes, France.
Wu et al., “Chinese Text Segmentation for Text Retrieval: Achievements and Problems”, Journal of the American Society for Information Science, 44(9) :532-542, 1993.
Chang et al., “A Multiple-Corpus Approach to Recognition of Proper Names in Chinese Texts”, Computer Processing of Chinese and Oriental Languages, vol. 8, No. 1, Jun. 1994, pp. 75-85.
Sproat et al., “A Stochastic Finite-State Word Segmentation Algorithm for Chinese”, Computational Linguistics, vol. 22, No. 3, pp. 377-404, 1996.
Gan et al., “A Statistically Emergent Approach for Language Processing: Application to Modeling Context Effects in Ambiguous Chinese Word Boundary Perception”, Computational Linguistics, vol. 22, No. 4, pp. 531-553, 1996.
Guo, J., “Critical Tokenization and its Properties”, Computational Linguistics, vol. 23, No. 4, pp. 569-596, 1997.
Yuan et al., “Splitting-Merging Model for Chinese Word Tokenization and Segmentation”, Department of Information Systems & Computer Sciences, National University of Singapore. No date.
Xiaohong Huang et al., “A Quick Method for Chinese Word Segmentation”, IEEE Conf. on Intelligent Processing Systems, Oct. 28-31, 1997, pp 1773-1776.
Charng-Kang Fan and Wen-Hsiang Tsai, “Automatic Word Identification in Chinese Sentences by the Relaxation Technique”, Computer Processing of Chinese and Oriental Languages, vol. 4, No. 1, Nov. 1988, pp 33-56.
“Automatic Recognition of Person Names Based On Corpus and Rule-Base” (English translation is provided), taken from Phrases, Articles and Tools (A Collection of Theses of Chinese Information Processing) 1991-1998 The Artificial Intelligence Lab of Beijing Computer Institute Nov. 1998.
Coates-Stephens, “The Analysis and Acquisition of Proper Names for the Understanding of Free Text”, Computers and the Humanities, vol. 26, 441-456, 1993.
Yhap, et al. “An On-Line Chinese Character Recognition System”, IBM J. Res. Develop., vol. 25, No. 3, pp. 187-189, May 1991.
Coates-Stephens, “The Analysis and Acquisition of Proper Names for Robust Text Understanding”, Dept. of Computer Science, City University London, England, pp. 1-8, 113-133 and 200-206, Oct. 1992.
“Rule-Based Word Identification for Mandarin Chinese Setences—A Unification Approach” by Ching-Long Yeh and Hsi-Jian Lee, forComputer Processing of Chinese & Oriental Languages, vol. 5, No. 2, Mar. 1991.
“Unknown Word Detection and Segmentation of Chinese Using Statistical and Heuristic Knowledge” by Jian-Yun Nie et al., forCommunications of COLIPS, vol. 5, Nos. 1 & 2, Dec. 1995, pp. 47-57.
“A Probabilistic Algorithm for Segmenting Non-Kanji Japanese Strings” by Virgina Teller and Eleanor Olds Batchelder, for Natural Language Processing, Jul. 31, 1994.
“Method of Segmenting Texts into Words” forIBM Technical Disclosure

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Word segmentation in chinese text does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Word segmentation in chinese text, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Word segmentation in chinese text will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3151872

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.