Text classifying parameter generator and a text classifier...

Data processing: presentation processing of document – operator i – Presentation processing of document – Layout

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000, C707S793000, C707S793000, C705S002000, C705S003000, C702S179000

Reexamination Certificate

active

06704905

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The invention generally relates to a text classifier for classifying a given text into a particular one or more of predetermined categories and, more specifically, to a method and system for generating and training (or optimizing) parameters for used in such a text classifier.
2. Description of the Prior Art
Text data stored in some computer-based systems are increasing in amount and variety day by day. Such stored natural language text data include academic theses, patent documents, news articles, etc. In order for the stored text data to be effectively utilized as information, it is necessary to classify each item of the stored text data into an appropriate category or categories. For this purpose, there have been proposed various types of text classifiers so far.
The present invention relates to a text classification technique, inter alia, of the type that uses a vector space. Vector space-based text classification techniques are disclosed in, for example:
U.S. Pat. No. 5,671,333 issued Sep. 23, 1997 to J. A. Catlett et al., entitled “Training apparatus and methods”;
U.S. Pat. No. 6,192,360 issued Feb. 20, 2001 to S. T. Dumais et al., entitled “Methods and apparatus for classifying text and for building a text”, which introduces a variety of classification techniques including the theory and operation of Support Vector Machines;
Japanese patent unexamined publication No. 11-053394 (1999), by N. Nomura, entitled “Device and method for document processing and storage medium storing”; and
Japanese patent unexamined publication No. 2000-194723 (2000), by K. Mitobe et al., entitled “Similarity display device, storage medium stored with similarity display program, document processor, storage medium stored with document processing program and document processing method”.
All of references cited above are incorporated herein by reference.
In vector space-based text classifiers, an M-dimensional vector space is spanned by the basis comprised of a set of vectors V
1
, V
2
, . . . , V
M
corresponding to M words W1, W2, . . . , WM constituting a dictionary. An object or text to be classified is expressed in a point in the vector space. That is, a text or document to be classified is expressed as a feature vector (or document vector) which is a linear combination of the basis (V
1
, V
2
, . . . , V
M
). Each of the components of a feature vector of a given text is expressed by using the frequency of occurrences, in the given text, of a word associated with the component. Each of the categories in a category set into which an object text is classified is expressed by a reference vector defined for the category. Again, each reference vector is expressed in a linear combination of the basis (V
1
, V
2
, . . . , V
M
). The degree of closeness of a given text to a class or category is calculated by finding an inner product of the feature vector of the given text and the reference vector for the category, by finding a distance between the two vectors. Whether the given text belongs to the category or not is determined on the basis of the calculated degree of closeness.
The dimension of the feature vectors may be reduced by applying a lower rank approximation through the singular value decomposition to a document-word matrix obtained by arranging the feature vectors of the documents in a set of documents to be classified. Each component of such a dimension-reduced feature vector for an object document reflects not the frequency of a word itself but the extent to which the object document relates to a set of (weighted) words. In this case, mathematical operations such as distance calculations, inner product calculations and so on are possible in the same manner as in case of the original vector space.
A vector space-based classifier varies the result or the decision on whether a document belongs to a particular category depending on the reference vectors associated with respective categories and the magnitude (or threshold) of the degree of closeness within which magnitude the document is classified into the particular category. The components of the reference vectors and the threshold values of the degrees of closeness for all the categories of a set of categories are called “classification parameters”. In order to achieve accurate classification, the classification parameters have to be properly determined or optimized.
In conventional parameter training, samples (i.e., documents selected for training) are classified by using a classifier with roughly determined initial classification parameters. Reviewing the classification result, classification parameters are modified. This trial-and-error process is iterated until satisfactory classification is reached. The modification of classification parameters is achieved either by an operator directly modifying the parameters him/her-self or by an operator correcting the classification results and the classifier recalculating the parameters through machine learning based on the operator's corrections.
However, in directly modifying schemes, it is difficult for the operator to know which of a large number of parameters to modify and how to modify one or more parameters selected for modification. Also, in classification result correcting schemes, it is difficult for the operator to know which of a large number of classification results to correct. These difficulties make the classification parameter modification a time taking task, which does not necessarily yield desirable classification parameters.
The present invention has been made to overcome the above and other problems in the art.
What is needed is a classification parameter generating method and system for enabling the operator to train the classification parameters interactively and effectively through various data analysis and selection tools.
What is needed is a classification parameter generating method and system that can be used for the case where each of reference vectors for the categories is considered to point statistically distributed points instead of a fixed point.
What is needed is a classification parameter generating method and system capable of calculating hitting rates for the samples having been reviewed. The hitting rate is the rate of the number of documents whose CDOM and evaluated CDOM equal each other for the category Cr to the number of documents whose CDOM for the category Cr has been evaluated.
What is needed is a classification parameter generating method and system with sample set generating and expanding capabilities. What is needed is a text classifier that uses a plurality of sets of classification parameters.
What is needed is a text classifier for determining whether a given text belongs to a specified category.
SUMMARY OF THE INVENTION
According to the principles of the invention, a method of and system for generating a set of parameters for user in determining whether a given document belongs to a specified one of a plurality of predetermined categories is provided. The system comprises a set of documents, each document having an identifier (ID); a document data set containing a record for each document which record contains a document ID of the document and a feature vector representing features of the document in a predefined vector space; and a category data set containing a record for each category which record contains a category ID of the category, a category name and the set of parameters. The parameters include a reference vector representing features of the category in the predefined vector space and a threshold value determined for the category. In this system, a membership score indicative of whether the document belongs to the specified category is calculated for each document by using the feature vector of the document, the reference vector of the specified category and a threshold value of the specified category. An evaluation sample selection screen enables an operator to interactively enter various command parameters for selecting documents for which the calculated membership scores

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Text classifying parameter generator and a text classifier... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Text classifying parameter generator and a text classifier..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Text classifying parameter generator and a text classifier... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3221733

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.