Method and apparatus for learning the morphology of a...

Data processing: speech signal processing – linguistics – language – Linguistics – Natural language

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06405161

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates generally to a method and apparatus for learning the morphology of a natural language. More particularly, the present invention relates to unsupervised acquisition, particularly using automatic means such as a computer, of the morphology of languages, particularly European languages.
The morphology of a language describes the structure of the words which form that language. The structure consists, generally, of stems or roots along with affixes such as prefixes and suffixes which modify the stem. Morphology involves collecting and organizing stems and associated affixes.
Development of morphologies of languages is a desirable goal. Previously, a morphology for a language has been developed by a trained linguist working manually to identify the appropriate stems and affixes and other structural features of a language. Such a project requires several man-weeks or more to accomplish. A better solution would involve unsupervised learning by automatic means, such as a programmed general purpose computer operating only on an input of a large corpus of a language.
Developing an unsupervised learner using raw text as its sole input offers several attractive aspects, both theoretical and practical. At its most theoretical, unsupervised learning constitutes a (partial) linguistic theory, producing a completely explicit relationship between data and analysis of that data. A tradition of considerable age in linguistic theory sees the ultimate justification of an analysis A of any single language L as residing in our ability to show that analysis A derives from a particular linguistic theory LT, and that LT works properly across a range of languages (that is, not just for language L).
The development of a fully automated morphology-generating tool would be of considerable interest. Good morphologies of many European languages are yet to be developed. With the advent of considerable historical text available online (such as the ARTFL data base of historical French), it is of great interest to develop morphologies of particular stages of a language. The language, and the process of automatic morphology-writing can simplify this process—where there are no native-speakers available—considerably. Such a system can also be used as a stemmer in the context of an information retrieval system, identifying stems, and multiplets of related stems.
A third motivation for developing such a system is that it can serve as a preparatory phase for an unsupervised grammar acquisition system. A significant proportion of the words in a large corpus can be assigned to categories, though the labels that are assigned by the morphological analysis are corpus-internal; nonetheless, the assignment of words into distinct morphologically motivated categories can be of great service to a syntax-acquisition device.
Development of language morphologies has value outside of purely linguistic endeavors. In many word processing programs, spell checking routines cannot properly recognize all forms of some words. As such routines are expanded to include new languages, the morphology of the new languages is necessary to efficiently implement the routine. The same is true of speech recognition routines.
There has been much less work done on automatic acquisition in the area of morphology than in the areas of syntactic analysis or part of speech tagging. Several researchers have explored the morphophonologies of natural language in the context of two level systems in the style of the model developed by Kimmo Koskenniemi [Koskenniemi, Kimmo,. Two-level Morphology: A General Computational Model for Word-form Recognition and Production, Publication no. 11, Department of General Linguistics, Helsinki, University of Helsinki, (1983)], Lauri Karttunen [Karttunen, Lauri. 1993. Finite State Constraints. In John Goldsmith (ed.),
The Last Phonological Rule,
pp. 173-194, Chicago, University of Chicago Press (1993)],M. R. Brent [Brent, M. R., Minimal Generative Models: A Middle Ground Between Neurons and triggers, in Proceedings of the 15th Annual Conference of the Cognitive Science Society, pages 28-36,Hillsdale, N.J., Lawrence Erlbaum Associates] and others. Morphophonology is the study of the changes in the realization of a given morpheme that are dependent on the context in which it appears. This is for the most part the problem of the treatment of regular allomorphy (also known as automatic morphophonology), that is, the treatment of the principles that determine how two surface-forms may be accounted for as the realization of a single string of characters, but subject to rules modifying those characters in particular contexts.
One closely related effort is that attributed to Andreev [Andreev, N. D. (editor) Statistiko-kombinatornoe modelirovanie iazykov, Moscow and Leningrad, Nauka (1965)] and discussed in Altmann and Lehfeldt [Altmann, Gabriel and Werner Lehfeldt,
Einführung in die Quantitative Phonologie,
Quantitative Linguistics vol. 7. Bochum, Studienverlag Dr. N. Brockmeyer (1980), esp. pp. 195ff)], though their description is limited and it does not facilitate comparison with the present approach. Dzeroski and Erjavec [Dzeroski, S. and T. Terjavec. 1997, Induction of Slovene nominal paradigms, in Nada Lavrac, Saso Dzeroski (eds.): Inductive Logic Programming, 7th International Workshop, ILP-97, Prague, Czech Republic, Sep. 17-20, 1997, Proceedings, Lecture Notes in Computer Science, Vol. 1297, Springer, 1997. Pp. 17-20. (1997)] report on work that they have done on Slovene, a south Slavic language with a complex morphology, in the context of a similar project. Their goal essentially was to see if an inductive logic program could infer the principles of Slovene morphology to the point where it could correctly predict the nominative singular form of a word if it were given an oblique (non-nominative) form. Their project apparently shares with the present invention the requirement that the automatic learning algorithm be responsible for the decision as to which letters constitute the stem and which are part of the suffix(es), though the details offered by Szeroski and Erjavec are sketchy as to how this accomplished. The learning algorithm is presented with a labeled pair of words—a base form, and an inflected form. It is not clear from their description whether the base form that they supply is a surface form from a particular point in the inflectional paradigm (the nominative singular), or a more articulated underlying representation in a generative linguistic sense; the former appears to be their policy.
Szeroski and Erjavec's goal is the development of rules couched in traditional linguistic terms; the categories of analysis used to label and used in the analysis are decided upon ahead of time by the programmer (or more specifically the tagger of the corpus), and each individual word is identified with regard to what morphosyntactic features it bears. The form bolecina is marked, for example, as a feminine noun singular genitive.
In some respects, the problem in the present application is akin to the problem of word-breaking in Asian languages, a problem that has been tackled by a wide range of researchers. From the current perspective, the most interesting attempt is that provided by de Marcken [de Marcken, Carl. 1995.
Unsupervised Language Acquisition,
Ph.D. dissertation, MIT (1995)], which describes an unsupervised learning algorithm for the development of a lexicon that de Marcken applies to a written corpus of Chinese, as well as to written and spoken corpora of English (the English text has had the spaces between words removed). A Minimal Description Length (MDL) framework is employed [Rissanen, Jorma.
Stochastic Complexity in Statistical Inquiry.
World Scientific Publishing Co.(1989)], and reports excellent results. The algorithm begins by taking all individual characters to be the baseline lexicon, and it successively adds items to the lexicon if the items will be useful

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for learning the morphology of a... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for learning the morphology of a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for learning the morphology of a... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2927597

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.