Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-10-22
2001-04-03
Hong, Stephen S. (Department: 2176)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000
Reexamination Certificate
active
06212532
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to computer text classification and, more particularly, to a framework which provides an environment where testing several options can be done in an efficient and structured manner.
2. Background Description
Businesses and institutions generate many documents in the course of their commerce and activities. These are typically written for exchange between persons without any plan for machine storage and retrieval. The documents, for purposes of differentiation, are described as “natural language” documents as distinguished from documents or files written for machine storage and retrieval.
Natural language documents have for some time been archived on various media, originally as images and more recently as converted data. More specifically, documents available only in hard copy form are scanned and the scanned images processed by optical character recognition software to generate machine language files. The generated machine language files can then be compactly stored on magnetic or optical media. Documents originally generated by a computer, such as with word processor, spread sheet or database software, can of course be stored directly to magnetic or optical media. In the latter case, the formatting information is part of the data stored, whereas in the case of scanned documents, such information is typically lost.
There is a significant advantage from a storage and archival stand point to storing natural language documents in this way, but there remains a problem of retrieving information from the stored documents. In the past, this has been accomplished by separately preparing an index to access the documents. Of course, the effectiveness of this technique depends largely on the design of the index. A number of full text search software products have been developed which will respond to structured queries to search a document database. These, however, are effective only for relatively small databases and are often application dependent; that is, capable of searching only those databases created by specific software applications.
The natural language documents of a business or institution represents a substantial resource for that business or institution. However, that resource is only a valuable as the ability to access the information it contains. Considerable effort is now being made to develop software for the extraction of information from natural language documents. Such software is generally in the field of knowledge based or expert systems and uses such techniques as parsing and classifying. The general applications, in addition to information extraction, include classification and categorization of natural language documents and automated electronic data transmission processing and routing, including E-mail and facsimile.
SUMMARY OF THE INVENTION
It is therefore an object of the present invention to provide an environment where testing out several options can be done in an efficient and structured manner.
The present invention describes a method and apparatus for computer text classification and, more particularly, to a framework which provides an environment where testing several options can be done in an efficient and structured manner.
The process of the present invention includes mainly:
1. Feature definition: Typically this involves breaking the text up into tokens. Tokens can then be reduced to their stems or combined to multi-word terms.
2. Feature count: Typically this involves counting the frequencies of tokens in the input texts. Tokens can be counted by their absolute frequency, and several relative frequencies (relativized to the document length, the most frequent token, square root, etc.).
3. Feature selection: This step includes weighting features (e.g., depending on the part of the input text they occur in: title vs. body), filtering features depending on how distinctive they are for texts of a certain class (filtering can be done by stop word list, based on in-class vs. out-class frequency etc.).
The present invention provides tools for all these tasks.
The apparatus of the present invention includes inputting raw annotated input data document collection means which collects raw data from an application. The raw data is submitted to a data preparation module, where the data is prepared for testing and training. The data preparation module splits the data randomly according to a user specification and submits a portion of the prepared data to a test data collection module and a portion of the data to a training data document collection module. The test data document collection module submits the data to be tested to a testing module, while the training data module submits data for training to a feature extraction module. The feature extraction module is divided into a feature definition module and a feature selection module, each having their own configuration files. The feature definition module, breaks the text up into tokens, which can then be reduced to their stems or combined to multi-word terms. The feature selection module weights the features. The feature selection module may also filter features depending on how distinctive they are for texts of a certain class.
The extracted data is then submitted to a feature vector module where the extracted data is provided in a vector format, such as a feature count table. This data may then be submitted back to the feature extraction module, where it may then be submitted to a reduced feature vector module. The reduced feature vector module provides the data in a simpler vector format that uses less disk space and is easier to process at a later time. The vector data is then submitted to a machine learning module where an algorithm is applied to the data. At this stage, the present invention stores the various data in a directory tree module, which may store the data in various formats. The testing module then tests the data and provides a precision, recall, accuracy or other statistic analysis of the tested data, as described in detail below. The test module may be provided in a report format.
REFERENCES:
patent: 5050222 (1991-09-01), Lee
patent: 5117349 (1992-05-01), Tirfing et al.
patent: 5371807 (1994-12-01), Register et al.
patent: 5873056 (1999-02-01), Liddy et al.
patent: 6047277 (2000-04-01), Parry et al.
patent: 6105023 (2000-08-01), Callan
patent: 6137911 (2000-10-01), Zhilyaev
Salton et al., “Automatic structuring and retrieval oflarge text files”; Commun. ACM 37, 2 (Feb. 1994), pp. 97-108.*
Riloff, “Using cases to represent context for text classification”; Proceedings of the second international conference on Information and knowledge management,1993, pp. 105-113.*
Hoch, “Using IR techniques for text classification in document analysis”;Proceedings of the seventeenth annual international ACM-SIGIR conference on Research and development in information retrieval, 1994, pp. 31-40.
Hampp-Bahamueller Thomas
Johnson David B.
Hong Stephen S.
International Business Machines - Corporation
Kaufman, Esq. Stephen C.
McGuireWoods LLP
LandOfFree
Text categorization toolkit does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Text categorization toolkit, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Text categorization toolkit will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2537641