Method and apparatus for automatic identification of word...

Data processing: speech signal processing – linguistics – language – Linguistics – Natural language

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06185524

ABSTRACT:

TECHNICAL FIELD
The present invention relates to the identification of word boundaries in continuous text.
BACKGROUND ART
The identification of word boundaries in continuous text is used in several areas such as word processing, text processing, machine translation, fact extraction, and information retrieval. Prior art methods for identifying word boundaries have used various approaches including whole words; word-initial and word-final n-grams and their frequencies; or a hidden Markov model of n-grams, word boundaries and their frequencies.
The article J. Guo, “An Efficient and Complete Algorithm for Unambiguous Word Boundary Identification”, formerly found at http://sunzi.iss.nus.sg:1996/guojin/papers/acbci/acbci.html and as referenced in J. Guo, A Comparative Experimental Study on English and Chinese Word Boundary Ambiguity,” Proceedings of the International Conference on Chinese Computing 96 (ICC 96) June 4-7, 1996 Singapore (National University of Singapore, Singapore), pp. 50-55, discloses a method which uses whole words implemented by an Aho-Corasick finite-state automaton. Another prior art method which uses a dictionary of whole words is U.S. Pat. No. 5,448,474, “Method for isolation of Chinese words from connected text”. The foregoing references are herein incorporated by reference. A disadvantage to methods using whole words or entire vocabularies is the amount of storage space required. In addition, only words included in the dictionary may be identified. Finally, it is not possible to rank or order competing possible word boundary candidates or to establish the best word boundary among competing possible word boundary candidates.
Several methods have attempted to overcome the problems presented by using a dictionary of whole words. In U.S. Pat. No. 5,806,021, “Automatic Segmentation of Continuous Text Using Statistical Approaches,” Chen et. al., a method is disclosed which uses two statistical methods. First, forward and backward matching is performed using a vocabulary with unigram frequencies. Then, a score is calculated using statistical language models. Another prior art method uses a combination of rules, statistics and a dictionary. (See U.S. Pat. No. 5,029,084, “Japanese Language Sentence Dividing Method and Apparatus”, Morohasi et. al.) The foregoing references are herein incorporated by reference.
SUMMARY OF THE INVENTION
In accordance with an embodiment of the invention, a method for identifying word boundaries in continuous text comprises: (a) comparing the continuous text to a set of varying length strings to identify candidate word-initial boundaries and candidate word-final boundaries in the continuous text, each candidate word-initial boundary and candidate word-final boundary having an associated probability value and (b) identifying each candidate word boundary in the continuous text by calculating a word boundary score for such candidate word boundary using the probability values associated with the candidate word-initial boundaries and candidate word-final boundaries identified in step (a). The set of varying length strings may include words. In a preferred embodiment, the set of varying length strings includes words and n-grams. In a further preferred embodiment for the English language, the words are one and two character words and the n-gams are trigrams.
In another embodiment, the probability value associated with a candidate word-initial boundary is the probability that the string, beginning with the candidate word-initial boundary, begins a word. The probability value associated with a candidate word-final boundary is the probability that the string, ending with the candidate word-final boundary, ends a word. In a further embodiment, the method further includes verifying each segment defined by the candidate word boundaries identified in step (b) against a string database.
In accordance with a further embodiment of the invention, a device for identifying word boundaries in continuous text comprises a string comparator, to identify candidate word-initial boundaries and candidate word-final boundaries in the continuous text by comparing the continuous text to a set of varying length strings, each candidate word-initial boundary and candidate word-final boundary having an associated probability value and a boundary checker, coupled to the string comparator, to identify each candidate word boundary in the continuous text by calculating a word boundary score for such candidate word boundary using the probability values associated with the candidate word-initial boundaries and candidate word-final boundaries identified by the string comparator. In another further embodiment, the device further comprises a string database and a chart parser, coupled to the boundary checker, to verify each segment defined by the candidate word boundaries identified by the boundary checker against the string database.
The set of varying length strings may include words. In a preferred embodiment, the set of varying length strings includes words and n-grams. In a further preferred embodiment for the English language, the words are one and two character words and the n-grams are trigrams. In another embodiment, the probability value associated with a candidate word-initial boundary is the probability that the string, beginning with the candidate word-initial boundary, begins a word. The probability value associated with a candidate word-final boundary is the probability that the string, ending with the candidate word-final boundary, ends a word.
In accordance with another further embodiment, a digital storage medium encoded with instructions which, when loaded into a computer, may establish any of the devices previously discussed.


REFERENCES:
patent: 4750122 (1988-06-01), Kaji et al.
patent: 5029084 (1991-07-01), Morohasi et al.
patent: 5040218 (1991-08-01), Vitale et al.
patent: 5146405 (1992-09-01), Church
patent: 5448474 (1995-09-01), Zamora
patent: 5488719 (1996-01-01), Kaplan et al.
patent: 5721939 (1998-02-01), Kaplan
patent: 5806021 (1998-09-01), Chen et al.
patent: 5926784 (1999-07-01), Richardson et al.
patent: 5949961 (1999-09-01), Sharman
patent: 5960385 (1999-09-01), Skiena et al.
patent: 5999896 (1999-12-01), Richardson et al.
patent: 6035268 (2000-03-01), Carus et al.
Harry Tennant: “Case Study: The Chart Parser”, Natural Language Processing, Petrocelli book, New York/Princeton Press, pp. 75-101, 1981.
Bates, et al.: “Recognizing Substrings of LR(k) Languages in Linear Time”, ACM Transactions on Programming Languages and Systems, vol. 16,No. 3, pp. 1051-1077, May 1994.
EL Guedjo, P.O., et al.:“A Chart parser to Analyze Large Medical Corpora”, Proceedings of the 16th Annual Inter. Conf. of IEEE Eng. in Med. & Biol. Soc., vol. 2, pp. 1404-1405, Nov. 1994.
“Efficient String Matching: An Aid to Bibliographic Search”, Aho and Corasick, Bell Laboratories, Communications of the ACM, Jun. 1975, vol.18, No. 6, pp. 333-340.
“The N-Best Algorithm: An Efficient and Exact Procedure for Finding The N Most Likely Sentence Hypotheses”, Schwartz et al., BBN Systems and Technologies Corp., 1990 IEEE.
“A Statistical Method for Finding Word Boundaries in Chinese Textp”, Sproat and Shih, Computer Processing of Chinese & Oriental Languages, vol. 4, No. 4, Mar. 1990.
“A Stochastic Finite-State Word-Segmentation Algorithm For Chinese”, Sproat et al., 32ndAnnual Meeting of the Association for Computational Linguistics, (Jun. 27, 1994, Las Cruces, New Mexico,) (1994).
“An Efficient and Complete Algorithm for Unambiguous Word Boundary Identification”, Jin, G., http://sunzi.iss.nus.sg:1996/guojin/papers/acbci/acbci.html.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for automatic identification of word... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for automatic identification of word..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for automatic identification of word... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2593034

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.