Method of automatically classifying a text appearing in a...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06295543

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to a method for classifying text by significant words in the text.
2. Description of the Related Art
From the reference A. Dengel et al., ‘Office Maid—A System for Office Mail Analysis, Interpretation and Delivery’, Int. Workshop on Document Analysis Systems, a system is known by means of which, for example, business letter documents can be categorized and can then be forwarded, or stored selectively, in electronic form or paper form. For this purpose, the system contains a unit for segmenting the layout of the document, a unit for optical text recognition, a unit for address detection and a unit for contents analysis and categorization. For the segmentation of the document, a mixed bottom-up and top-down approach is used, the individual steps of which are
Recognition of the contiguous components,
Recognition of the text lines,
Recognition of the letter segments,
Recognition of the word segments, and
Recognition of the paragraph segments.
The optical text recognition is divided into three parts:
Letter recognition in combination with lexicon-based word verification,
Word recognition, with the classification from letters and word-based recognition.
The address recognition is performed by means of a unification-based parser which operates with an attributed context-free grammar for addresses. Accordingly, text parts correctly parsed in the sense of the address grammar are the addresses. The contents of the addresses are determined via character equations of the grammar. The method is described in the reference M. Malburg and A. Dengel, ‘Address Verification in Structured Documents for Automatic Mail Delivery’.
Information retrieval techniques for the automatic indexing of texts are used for the contents analysis and categorization. In detail, this takes place as follows:
Morphological analysis of the words
Elimination of stop words
Generation of word statistics
Calculation of the index term weight by means of formulas known from information retrieval such as, for example, inverse document frequency.
The index term weights calculated in this manner are then used for determining for all categories a three-level list of significant words which characterizes the respective category. As described in the reference A. Dengel et al., ‘Office Maid—A System for Office Mail Analysis, Interpretation and Delivery’, Int. Workshop on Document Analysis Systems, these lists are then manually revised after the training phase.
A new business letter is then categorized by comparing the index terms of this letter with the lists of the significant words for all categories. The weights of the index terms contained in the letter are multiplied by a constant depending on significance and are added together. Dividing this sum by the number of index terms in the letter then results in a probability for each class. The detailed calculations are found in the reference R Hoch, ‘Using IR Techniques for Text Classification in Document Analysis’. The result of the contents analysis is then a list of hypotheses sorted according to probabilities.
SUMMARY OF THE INVENTION
The object forming the basis of the present invention consists in providing a method according to which the contents analysis of the text and thus the text classification is improved. In this connection, it is assumed that the text of the document is already available as digital data which are then processed further.
This object is achieved in accordance with the method for the automatic classification of a text applied to a document after the text has been transformed into digital data with the aid of a computer, in which each text class is defined by significant words, the significant words and their significance to the text class are stored in a lexicon file for each text class, a text to be allocated is compared with all text classes and, for each text class, the fuzzy set of words in text and text class and its significance to the text class is determined, the probability of the allocation of the text to the text class is determined from the fuzzy set of each text class and its significance to each text class, in which text class with the highest probability is selected and the text is allocated to this class.
Further developments of the invention are provided by further steps, wherein the text to be classified is morphologically analyzed in a morphological analyzer preceding the contents analysis, the morphologically analyzed text is supplied to a stochastic tagger in order to resolve lexical ambiguities, and the tagged text is used for text classification. Preferably, a relevance lexicon is generated for the classification of the text; for this purpose, a set of training texts is used, the classes of which are known; the frequencies of the classes, of words and of words in the respective classes are counted from this set; an empirical correlation between a word and class is calculated by means of these frequencies; this correlation is calculated for all words and all classes and the result of the calculation is stored in a file as a relevance of a word to a class, which file is used as a relevance file or a relevance lexicon.
In one embodiment, the correlation (or relevance) between a word and a class is established in accordance with the following formula:
rlv

(
w



in



c
)
:=
r

(
w
,
c
)
=
N
·

wc
-

w
·

c
(
N
·

w
2
-
(

w
)
2
)
·
(
N
·

c
2
-
(

c
)
2
)
where:
N=number of training texts,
&Sgr;wc=number of training texts of class c with word w, &Sgr;w=number of training texts with word w, &Sgr;c=number of training texts of class c.
One embodiment provides that only correlations greater than a selected value r-max are taken into consideration, which value is established at a significance level of e.g. 0.001. In such embodiment, the text to be examined and relevance lexicon are used for determining for each class the fuzzy set of significant words per class and its relevance per class, from the fuzzy set per class and its relevance to each class, the probability of its fuzzy set of relevant words is calculated, and the class with the maximum probability is determined from the probabilities per class and the text is allocated to this class.
In this example, the probability is calculated in accordance with the formula
prob



(
A
)
:=

x

μ



A

(
x
)
·
p

(
x
)
,
where &mgr;A is the membership function which specifies the extent to which the fuzzy set is allocated to a class, and which just corresponds to the correlation measure according to the above formula.
The present method may be used for automatic diagnosis from medical findings, in which the medical findings are considered to be the text and an illness is considered to be a class, in which method in a training phase the knowledge required for the classification is automatically learned from a set of findings the diagnosis of which is known, and a new finding is classified in accordance with the technique of fuzzy sets.
A case of application of the method is the automatic diagnosis from medical findings. If a medical finding is considered to be a text and an illness is considered to be a class, the problem of automatic diagnosis can be solved by means of the method of text classification. It is a considerable advantage of the method that it learns the knowledge needed for the classification automatically and unsupervised from a set of findings the diagnosis of which is known. There is no additional effort required by the doctor who only needs to write down the finding as usual. The learning takes place from the findings already in existence. After the training phase, a finding is then classified with the aid of the learned knowledge source and techniques of fuzzy sets. The class allocated to the findings corresponds to the illness diagnosed.
It is initially assumed that the tex

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method of automatically classifying a text appearing in a... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method of automatically classifying a text appearing in a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method of automatically classifying a text appearing in a... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2485694

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.