Method for generating descriptors for the classification of text

Data processing: speech signal processing – linguistics – language – Linguistics – Natural language

Patent

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

G06F 1728

Patent

active

060385278

DESCRIPTION:

BRIEF SUMMARY
BACKGROUND OF THE INVENTION

The invention relates to a method for generating descriptors for the classification of natural language texts.
The classification of a text is an assignment to a specific text class and is an important preprocessing step for the automatic further processing of texts. In particular for the automatic interpretation of texts, a preceding classification is of considerable importance because in this manner the expenditure for the knowledge base which needs to be maintained such as, e.g., dictionary memory, syntactic and semantic structure definition, can be limited considerably and the recognition performance can be greatly increased.
Text classification can be divided roughly into two steps, namely the extraction of descriptors and, based on this, the assignment to a class. The selection of the descriptors is of essential importance. The selection is a problem especially for natural language texts having a variety of word forms.
For texts in the English language, which has a small morphological variation, the use of complete word forms or phrases is proposed in "Feature Selection and Feature Extraction for Text Categorization" by D. Lewis in Proc. of Speech and Natural Language Workshop 1992. For classification tasks in morphologically richer languages, word segments can be used as descriptors, with, e.g., the text being broken down into n-grams in "N-Gram-Based Text Categorization" by Canvar/Trenkle in Proc. of Int. Symp. on Document Analysis and Information Retrieval 1994, or use of a reduction to basic forms in "Using IR Techniques for Text Classification in Document Analysis" by R. Hoch in Proc. of SIGIR, 1994.
While the n-gram breakdown results in a very large number of descriptors, the reduction to basic forms requires an expensive analysis for the preparation of the necessary knowledge base. The known procedures are also susceptible to errors in the examined texts, such as typing errors or recognition errors in the character recognition or language recognition.


SUMMARY OF THE INVENTION

It is the object of the present invention to propose a method for generating descriptors which, in a simple manner, generates an amount of descriptors suitable for the classification on the basis of training texts.
This is achieved in a method for classifying a natural language text by means of descriptors including the steps of extracting word forms during a training phase on the basis of a plurality of training texts, carrying out a breakdown of word forms occurring in the text in such a manner that longer word forms, which comprise shorter word forms occurring in the text, are broken down into the shorter word forms and optimally, remaining word segments, and forming the descriptors from the word forms and word segments which are left after the breakdown. Advantageous dependent claims comprise advantageous features and modifications of the invention will become apparent from the following description.
The special advantage of the invention is that no knowledge specifications or only simple knowledge specifications are needed and that the method can thus be used easily for new applications. An advantageous embodiment, for example, provides a morphologically based limitation with respect to the word segments developing during the breakdown as a simple knowledge specification. The method according to the invention also specifically considers significant spelling errors or recognition errors in relevant descriptors on the assumption that such errors will likewise occur in the training texts and in the texts which are to be classified later.
Preferably, the breakdown is carried out repeatedly, with the word segments remaining in an occupation cycle being treated like word forms in the following breakdown cycle. The word forms and word segments obtained after the breakdown, optionally multiple breakdown, may still contain different variants of simpler basic forms produced through inflection or affixes. By separating prefixes and suffixes (including inflectional forms), the variety of word form

REFERENCES:
patent: 4771401 (1988-09-01), Kaufman et al.
patent: 5251129 (1993-10-01), Jacobs et al.
patent: 5331556 (1994-07-01), Black, Jr. et al.
patent: 5490061 (1996-02-01), Tolin et al.
patent: 5745602 (1998-04-01), Chen et al.
Kimbrell, Roy E., Searching for Text? Send an N-Gram! Byte vol. 13, N5, May 1988, pp. 297-312.
R. Hoch: "Using IR techniques for text classification in document analysis". In: SIGIR '94. Jul. 3-6, 1994, Dublin, Ireland, pp. 31-40.
W. Barth: "Volltextsuche mit sinnentsprechender Wortzerlegung". In: Wirtschaftsinformatik, Oct. 1990, Germany, vol. 32, No. 5, pp. 467-471.
K. Kotzias: "How to respond to different language particularities by indexing texts using automatic text analysis". In: Online Information 90, London, UK, Dec. 1990, pp. 62-68.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for generating descriptors for the classification of text does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for generating descriptors for the classification of text, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for generating descriptors for the classification of text will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-178709

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.