Automatic segmentation of a text

Data processing: speech signal processing – linguistics – language – Linguistics – Natural language

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06374210

ABSTRACT:

The invention relates to a method of segmenting a connected text into words, including the steps of reading an input string representing the connected text; identifying at least one sequence of isolated words in the input string by comparing the input string to words in a dictionary; and outputting at least one of the identified word sequences.
The invention farther relates to a system for segmenting a connected text into words; the system including means for reading an input string representing the connected text; means for identifying at least one sequence of isolated words in the input string by comparing the input string to words in a dictionary; and means for outputting at least one of the identified word sequences.
Increasingly advanced natural language processing techniques are used in data processing systems, such as speech processing systems, handwriting/optical character recognition systems, automatic translation systems, or for spell/grammar checking in word processing systems. Such systems frequently use statistical information relating to individual words or word sequences. The statistical information is obtained by analyzing large text corpora. For the analysis, individual words need to be identified in the text. In many languages, including the western languages, words are separated by boundary markers, such as a space or other punctuation marks, making identification easy. However, many other languages do not have boundary markers between words. Examples of such languages are many Asian languages such as Chinese, Japanese, and Hangul. Such languages are sometimes referred as agglutinative languages. Typically, these languages are written using special characters (“ideographs”), which each represent one or more syllables and usually a concept or meaningful unit. A word consists of one or more of these characters. A reader of a text in such a language must identify the boundaries of those words to make sense of the text. For many applications only one sequence of words must be identified.
From U.S. Pat. No. 5,448,474 a method and system for isolation of Chinese words from connected Chinese text is known. In this system a dictionary lookup process is performed, where all substrings of a text are identified. For each character of the text, it is checked for each word of the dictionary whether the word matches the text starting at that position. As an example, for the text “software”, at position
0
(first character of the text) a match is found for the words “so”, “soft”, and “software”; at position
1
for the words “of” and “oft”; at position
4
for “war” and “ware”; at position
5
for “a” and “are”; and at position
6
for “re”. For each match an entry is created in a table. The entry comprises the matching word, the position of the text at which the match starts, and the length of the word. If at a position no matching word is found, an entry is made in the table containing the individual character. In this way all possible words and unmatched characters are added to the table. Next, the number of entries in the table is reduced, based on criteria such as that a word must start adjacent to the end of a preceding word and must end adjacent to the start of a next word. Since in this way overlapping words (which are not adjacent) will be removed, parts of the text may not be covered by identified words. A separate restoration process is performed to correct undesired deletion of overlapping words, based on a criterion to keep the longest matching overlapping word. Finally, again all words are removed which are not adjacent the end or beginning of the text or another non-deleted word. The final outcome may contain several possible word sequences. Information regarding the frequency of occurrence of words in general texts may be used to select one of the sequences. For instance, a sequence with a two-character Chinese word may be selected above a same sequence with the two characters represented by two single-character words, since two character words are more common than single character words.
The known isolation procedure is complex and requires a restoration process to correct wrong deletions.
It is an object of the invention to provide a method and system of the kind set forth, which is more efficient.
To meet the object of the invention, the method is characterized in that the step of identifying at least one word sequence includes building a tree structure representing word sequence(s) in the input string in an iterative manner by:
taking the input string as a working string;
for each word of a dictionary:
comparing the word with a beginning of the working string; and
if the word matches the beginning of the working string:
forming a node in the tree representing the word;
associating with the node a part of the input string which starts at a position immediately adjacent an end position of the word; and
forming a sub-tree, linked to the node, representing word sequence(s) in the part of the input string associated with the node by using the associated part as the working string.
By building a tree structure, analyzing the input string automatically results in identifying only words which are adjacent to a preceding word. All identified word sequences of which the last word ends at the end of the input string are in principle possible. In this way words which are not possible (in view of preceding words) are not considered as candidates. This reduces the amount of data to be processed. Moreover, complex procedures of deleting words and reintroducing overlaps are not required. Segmenting the sample string “software” according to the invention results in a logical tree structure with two main branches, one branch having a single node representing the word “software” and one branch having two linked nodes representing the words “soft” and “ware” respectively. Consequently only three entries are required instead of 10 in the prior art system.
In an embodiment according to the invention as described in the dependent claim 2, a plurality of new words are added with different lengths if a predetermined criterion is met. By adding an unknown sequence of characters not just as single character words to a data structure, it becomes possible to identify multi-character new words in a simple way. This makes the procedure suitable for languages, such as Japanese, wherein many single characters do not represent a word. Furthermore, it allows identifying a multi-character word as the preferred new word in which case the single character words do not need to be entered to the dictionary. In this way it is avoided that the dictionary gets ‘polluted’ with single character words. Having many single character entries in a dictionary reduces the chance of a correct segmentation of the text into words. As an example, the text “thisbook” might get segmented into the sequence of words “t”, “his”, and “book” if the single character “t” is in the dictionary.
In an embodiment according to the invention as described in the dependent claim 3 such criterion is a global decision based on whether or not a word sequence could be identified using the existing dictionary. If no sequence could be identified, new words are added. This test may be performed by first building a tree structure using only known words of the existing dictionary, and after the tree has been build verifying whether at least one path represents a word sequence which matches the entire input string. The verification can be very simple by during the building of the tree structure setting a parameter (end-of-string-reached) as soon as a first path through the tree has reached the end of the input string.
In an embodiment as defined in the dependent claim 4, the new words are added to one or more end nodes of paths whose corresponding word sequences do not match the entire input string. Such nodes may simply be located by following paths through the tree and verifying if the end node of the path corresponds to the end of the input string (i.e. the words match as well as the location in the string. This can be checked in

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Automatic segmentation of a text does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Automatic segmentation of a text, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Automatic segmentation of a text will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2845490

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.