Method and apparatus for tokenizing text

Patent

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

395759, 395751, 395752, G06F 1720, G06F 1728

Patent

active

057219397

ABSTRACT:
An efficient method and apparatus for tokenizing natural language text minimizes required data storage and produces guaranteed incremental output. Id (text) is composed with a tokenizer to create a finite state machine representing tokenization paths. The tokenizer itself is in the form of a finite state transducer. The process is carried out in a breadth-first manner so that all possibilities are explored at each character position before progressing. Output is produced incrementally and occurs only when all paths collapse into one. Output may be delayed until a token boundary is reached. In this manner, the output is guaranteed and will not be retracted unless the text is globally ill-formed. Each time output is produced, storage space is freed for subsequent text processing.

REFERENCES:
patent: 5323316 (1994-06-01), Kadashevich et al.
patent: 5438511 (1995-08-01), Maxwell, III et al.
patent: 5477451 (1995-12-01), Brown et al.
patent: 5510981 (1996-04-01), Berger et al.
patent: 5594641 (1997-01-01), Kaplan et al.
Dialog File 88, Acc. No. 02265811: "Literate Programming: Weaving a Language-Independent Web," Van Wyk, Communication of the ACM, v. 32, No. 9, p. 1051 (5), Sep. 1989.
"Regular Models of Phonological Rule Systems", Ronald M. Kaplan et al., Computational Linguistics, vol. 20, No. 3.
"A Finite-State Architecture for Tokenization and Grapheme-to-Phoneme Conversion in Multilingual Text Analysis", Richard Sproat, From Text to Tags: Issues in Multilingual Language Analysis. Proceedings of the ACL SIGDAT Workshop. Mar. 27, 1995, Dublin, Ireland.
"A Stochastic Finite-State Word-Segmentation Algorithm For Chinese", Richard Sproat et al., 32nd Annual Meeting of the Association for Computational Linguistics. Jun. 27, 1994. Las Cruces, New Mexico.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for tokenizing text does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for tokenizing text, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for tokenizing text will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-1882144

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.