Data processing: speech signal processing – linguistics – language – Linguistics – Natural language
Reexamination Certificate
1998-02-04
2001-01-09
Thomas, Joseph (Department: 2747)
Data processing: speech signal processing, linguistics, language
Linguistics
Natural language
C704S010000, C707S793000
Reexamination Certificate
active
06173252
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Technical Field
The present invention relates to sentence segmentation and, more specifically, to apparatus and methods for segmenting a Chinese sentence to detect errors in a Chinese text file.
2. Discussion of Related Prior Art
As computers become more powerful and prevalent, they are relied upon to perform ever increasing tasks. One such task is the detection of errors in a Chinese text file (hereinafter referred to as “Chinese error check”).
Errors in a Chinese text file are generally the result of the following: keyboard entry errors, primarily caused by the same or similar input code (e.g., coded by pronunciation or stroke information); commonly committed errors due to insufficient knowledge (e.g., many people may regard “
” as a correct word when, in fact, the correct word should be “
”); grammatical errors (e.g., “
” should be “
”; this is the simplest one of its kind).
General approaches to error detection in a Chinese text file include the following three methods: lookup tables; a grammatical rule based method; and a statistical method. The first two methods have their shortcomings. For example, with the first method, it is obvious that no matter how big the table is, only a small fraction of errors can be included. Moreover, many errors are context dependent. Therefore, attempting to identify such errors by a simple comparison will likely result in wrongful identification. Regarding the second method, because of the complexity and the irregularity of Chinese grammar, this method can only serve as a supplement for another method. However, the third or statistical method is a practical method in frequent use today.
In the third method, potential errors are detected based on statistical information pertaining to either the collocation of characters and words or the characters and words themselves. The information is derived from a corpus. Since there is no natural word boundary in Chinese text, it is necessary to implement sentence segmentation. To segment a sentence, a dictionary is necessary. Traditionally, segmentation has been done non-statistically, by matching a string of characters in a sentence with the longest word in a dictionary. However, this third method does not and, in fact, is unable to treat ambiguities.
However, due to the rapid development of computers, segmentation by using statistical information of words is becoming increasingly popular. This method requires frequency information for each entry of the dictionary. The frequency information is a figure (hereinafter referred to as a “weight”) that represents the probability of a word appearing in the corpus. A method known as dynamic programming is used to determine the most probable segmentation based on the dictionary and the frequency information. The most probable segmentation is a partition such that the product of the weights of all its segmentation units is the largest among all possible ways of partitioning. It should be emphasized that the dynamic programming method is usually used in segmentation or part of speech tagging. Thus, all of the resulting segmentation units are entries of the dictionary in use.
The prior art includes two different methods for detecting errors in a Chinese text file using the statistical approach. In the first method, the sentence to be checked is not segmented. Instead, bigram statistical information (the weights) of the Chinese characters are applied directly to the collocation of any two successive characters of the sentence. Any two successive characters having a bigram weight smaller than a predetermined threshold will be regarded as a potential error. Otherwise, they are considered as legitimate collocations.
The second method consists of three main steps. First, a segmentation is implemented according to a given dictionary. The traditional longest match method with forward or backward scanning is usually adopted. Second, if predefined error libraries exist, neighboring segmentation units are recombined. A searching process will then determine if there are any matches with the entries of the predefined error libraries in the recombined units. Such matches will be regarded as potential errors. Third, for lone characters left out after such analysis (lone characters are those that stand alone in a resulting segmentation unit), a predefined threshold is applied. If the stand-alone weight of a lone character, derived from a corpus, is smaller than the threshold, the lone character will be regarded as a potential error.
In some research papers, the dynamic programming method was used to implement segmentation for Chinese sentences in terms of a regular dictionary with statistical information for each entry. However, this method is not suitable for the task of detecting errors in a Chinese text file. This is because the dynamic programming method is only used on “regular” words of the dictionary. Pre-defined errors (common errors committed by ordinary people), names, numbers, measure words, etc., are treated separately. The order in processing these different units may lead to distinct segmentation units. Classes can get entangled such that the leading or end character of a class not yet treated may be bound to other characters to form a unit of another class that is being treated. This entanglement results in erroneous segmentation, leading to a lower error detection rate and, more particularly, to a higher false alarm rate. For example, given the sentence: “
” (Li Da-Ming goes to work every day), the correct segmentation should be: “
”. However, according to the prior art, it would be segmented as follows: “
”. Since “
” is not a popular name, it may be spotted as a possible error. In particular, if this situation occurs with respect to a pre-defined error (that is, the predefined error is not segmented as a segmentation unit), the error may not be detected.
Thus, in implementing the statistical method, it would be highly advantageous to have all segmentation units determined uniformly in terms of statistical information derived from a corpus. In this way, all the classes (e.g., regular words, pre-defined errors, names, numbers, and measure words) would be treated on equal footing.
SUMMARY OF THE INVENTION
It is an object of the present invention to provide a method for Chinese sentence segmentation that treats commonly committed error strings, names of people, places and organizations, numbers, and combinations of numbers and measure words as ordinary segmentation units along with words in a regular dictionary.
Another object of the present invention is to provide a Chinese error check system having the highest error detection rate while keeping the false alarm rate the lowest, relative to conventional systems utilized for similar applications.
In one aspect of the present invention, a method of segmenting a Chinese sentence comprises: defining a plurality of classes for segmentation, along with words in a regular dictionary; assigning weights to the classes, relative to that of the words in the regular dictionary; and selecting a segmentation output conformable to a certain condition by means of dynamic programming.
In another aspect of the present invention, a Chinese error check system comprises: an input device for inputting a sentence to be checked; a regular dictionary storing device for storing regular words and their weights; a special segmentation classes storing device for storing special segmentation classes and their weights; a segmentation device for segmenting an inputted sentence by retrieving the contents of the regular dictionary storing device and the special segmentation class storing device and employing a dynamic programming method to select a most probable segmentation of the inputted sentence; a lone character bigram table storage device for storing the lone character bigram table, the table having the probability of Chinese character pairs being adjacent lone character pairs stored therein; and a segmentation results processing device operatively coupled to the lone character bigram table storage device an
Qiu Zhaoming
Yang Liping
F. Chau & Associates LLP
International Business Machines Corp.
Thomas Joseph
LandOfFree
Apparatus and methods for Chinese error check by means of... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Apparatus and methods for Chinese error check by means of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus and methods for Chinese error check by means of... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2457358