Parameterized word segmentation of unsegmented text

Image analysis – Image segmentation – Segmenting individual characters or words

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S181000, C382S229000, C704S010000

Reexamination Certificate

active

06678409

ABSTRACT:

BACKGROUND OF THE INVENTION
Word segmentation refers to the process of identifying the individual words that make up an expression of language, such as text. Word segmentation is useful for, among other things, checking spelling and grammar, synthesizing speech from text, and performing natural language parsing and understanding, all of which benefit from an identification of individual words.
Performing word segmentation of English text is rather straight forward, since spaces and punctuation marks generally delimit the individual words in the text. Consider the English sentence: “The motion was then tabled—that is removed indefinitely from consideration.”
By identifying each contiguous sequence of spaces and/or punctuation marks as the end of the word preceding the sequence, this English sentence may be straightforwardly segmented as follows:
The motion was then tabled—that is removed indefinitely from consideration.
However, word segmentation is not always so straightforward. For example, in unsegmented languages, such as Chinese, a written sentence consists of a string of evenly spaced characters, with no marking between the words. This is because the Chinese language originated as a monosyllabic language, meaning that there was a separate Chinese character for each word in the Chinese language. As the Chinese language developed, the requirement to add an additional character for each word became cumbersome. Thus, the language began to combine two or more characters to represent a new word, rather then developing a whole new character to represent the new word. Currently, the Chinese language has many polysyllabic words, which are commonly understood by those who speak the Chinese language.
However, due to the structure of Chinese words, there is not a commonly accepted standard for “wordhood” in Chinese. This problem is discussed in greater length in Duannu, San (1997). Wordhood in Chinese, in J. Packard (ed)
New Approaches to Chinese Word Formation
, Moton de Gruyter. While native speakers of Chinese in most cases are able to agree on how to segment a string of characters into words, there are a substantial number of cases (perhaps 15-20% or more) where no standard agreement has been reached.
Not only do different people segment Chinese text differently, but it may also be desirable to segment the text differently for different applications. For example, in natural language processing applications, such as information retrieval, word segmentation may be desirably performed in one way, in order to improve precision, while it may be desirably performed in a different way, in order to improve recall.
Therefore, it has been very difficult, in the past, to provide a word segmentation component which meets the needs of individuals who do not agree on how unsegmented text should be segmented. This problem is exacerbated when one considers that the general word segmentation rules may desirably change from application-to-application.
SUMMARY OF THE INVENTION
The present invention segments a non-segmented input text. The input text is received and segmented based on parameter values associated with parameterized word formation rules.
In one illustrative embodiment, the input text is processed into a form which includes parameter indications, but which preserves the word-internal structure of the input text. Thus, the parameter values can be changed without entirely re-processing the input text.


REFERENCES:
patent: 4850026 (1989-07-01), Jeng et al.
patent: 4887212 (1989-12-01), Zamora et al.
patent: 5029084 (1991-07-01), Morohasi et al.
patent: 5448474 (1995-09-01), Zamora
patent: 5454046 (1995-09-01), Carman, II
patent: 5473607 (1995-12-01), Hausman et al.
patent: 5651095 (1997-07-01), Ogden
patent: 5740549 (1998-04-01), Reilly et al.
patent: 5787197 (1998-07-01), Beigi et al.
patent: 5806021 (1998-09-01), Chen et al.
patent: 5850480 (1998-12-01), Scanion
patent: 5923778 (1999-07-01), Chen et al.
patent: 5933525 (1999-08-01), Makhoul et al.
patent: 5940532 (1999-08-01), Tanaka
patent: 6014615 (2000-01-01), Chen
patent: 6035268 (2000-03-01), Carus et al.
patent: 6173253 (2001-01-01), Abe et al.
patent: 6182029 (2001-01-01), Friedman
patent: 6363342 (2002-03-01), Shaw et al.
patent: 6374210 (2002-04-01), Chu
patent: 0 653 736 (1988-05-01), None
patent: 0 650 306 (1994-10-01), None
patent: WO 95/12955 (1995-05-01), None
patent: WO 97/17682 (1997-05-01), None
patent: WO 97/35402 (1997-09-01), None
Lua, et al “An application of information theory in Chinese word segmentation”, Computer processing of Chinese & Oriental Languages, pp. 1-9, 1994.*
Palmer, et al “Chinese word segmentation and information retrieval”, AAAI Spring Symosium on Cross-Language Text and Speech Retrieval, pp. 1-6, 1997.*
Chen, et al. “Chinese text retrieval without using dictionary”, ACM, pp. 42-49, 1997.*
Chi, et al “Word segmentation and recognition for web document framework”, ACM, pp. 458-465, Jan. 1999.*
Ge, et al “Discovering Chinese words from un-segmented text”, ACM, pp. 271-272, Jan. 1999.*
Kuo, et al “A new method for the segmentation of mixed handprinted Chinese/English characters”, IEEE, pp. 810-813, 1993.*
Packard, Jerome L. (1998) New Approached to Chinese Word Formation: Morphology, Phonology and the Lexicon in Modern and Ancient Chinese. Mouton de Gruyter, New York.
Ren, Xueliang (1981) Word Formation in Chinese. China Press of Social Sciences, Beijing.
Coates-Stephens “The Analysis and Acquisition of Proper Names for the Understanding of Free Text”, Computer and the Humanities, vol. 26, 441-456, 1993.
Yhap, et al. “an On-Line Chinese Character Recognition System”, IBM J. Res. Develop. vol. 25, No. 3, pp. 187-189, May 1991.
Coates-Stephens “The Analysis and Acquisition of Proper Names for Robust Text Understanding”, Dept. of Computer Science, CIty University, London England, Oct. 1992, pp. 1-8, 28-38, 113-133, and 200-206.
Chen et al., “Word Identification for Mandarin Chinese Sentences”, Proceedings of the 14th International Conference on Computational Linguistics (Coling '92), pp. 101-107, Nantes, France.
Chang et al., “A Multiple-Corpus Approach to Recognition of Proper Names in Chinese Texts”, Computer Processing of Chinese and Oriental Languages, Vo. 8, No. 1, Jun. 1994, pp. 75-85.
Yuan et al., “Splitting-Merging Model for Chinese Word Tokenization and Segmentation”, Department of Information Systems & Computer Sciences, National University of Singapore. No date.
Kok-Wee Gan et al., “A Statistically Emergent Approach for Language Processing: Application to Modeling Context Effects in Ambiguous Chinese Word Boundary Perception,” Computational Linguistics, vol. 22, No. 4, 1996, pp. 531-551.
Jin Guo, “Critical Tokenization and its Properties,” Computational Linguistics, vol. 23, No. 4, 1997, pp. 569-596.
Richard Sproat et al., “A Stochastic Finite-State Word-Segmentation Algorithm for Chinese,” Computational Linguistics, vol. 22, No. 3, 1996, pp. 376-404.
Zimin Wu et al., “Chinese Text Segmentation for Text Retrieval: Achievements and Problems,” Journal of the American Society for Information Science, vol. 44, No. 9, 1993, pp. 532-542.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Parameterized word segmentation of unsegmented text does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Parameterized word segmentation of unsegmented text, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Parameterized word segmentation of unsegmented text will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3214777

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.