Patent
1995-08-03
1998-02-24
McElheny, Jr., Donald E.
395759, 395751, 395752, G06F 1720, G06F 1728
Patent
active
057219397
ABSTRACT:
An efficient method and apparatus for tokenizing natural language text minimizes required data storage and produces guaranteed incremental output. Id (text) is composed with a tokenizer to create a finite state machine representing tokenization paths. The tokenizer itself is in the form of a finite state transducer. The process is carried out in a breadth-first manner so that all possibilities are explored at each character position before progressing. Output is produced incrementally and occurs only when all paths collapse into one. Output may be delayed until a token boundary is reached. In this manner, the output is guaranteed and will not be retracted unless the text is globally ill-formed. Each time output is produced, storage space is freed for subsequent text processing.
REFERENCES:
patent: 5323316 (1994-06-01), Kadashevich et al.
patent: 5438511 (1995-08-01), Maxwell, III et al.
patent: 5477451 (1995-12-01), Brown et al.
patent: 5510981 (1996-04-01), Berger et al.
patent: 5594641 (1997-01-01), Kaplan et al.
Dialog File 88, Acc. No. 02265811: "Literate Programming: Weaving a Language-Independent Web," Van Wyk, Communication of the ACM, v. 32, No. 9, p. 1051 (5), Sep. 1989.
"Regular Models of Phonological Rule Systems", Ronald M. Kaplan et al., Computational Linguistics, vol. 20, No. 3.
"A Finite-State Architecture for Tokenization and Grapheme-to-Phoneme Conversion in Multilingual Text Analysis", Richard Sproat, From Text to Tags: Issues in Multilingual Language Analysis. Proceedings of the ACL SIGDAT Workshop. Mar. 27, 1995, Dublin, Ireland.
"A Stochastic Finite-State Word-Segmentation Algorithm For Chinese", Richard Sproat et al., 32nd Annual Meeting of the Association for Computational Linguistics. Jun. 27, 1994. Las Cruces, New Mexico.
McElheny Jr. Donald E.
Thomas Joseph
Xerox Corporation
LandOfFree
Method and apparatus for tokenizing text does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for tokenizing text, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for tokenizing text will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-1882144