Data processing: presentation processing of document – operator i – Presentation processing of document – Layout
Reexamination Certificate
2000-06-12
2004-02-24
Shah, Sanjiv (Department: 2176)
Data processing: presentation processing of document, operator i
Presentation processing of document
Layout
C706S045000
Reexamination Certificate
active
06697998
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to a method of automated labeling of unlabeled text data and, more particularly, to a method that assigns labels without manual intervention and can also be used to extract relevant features for a keyword search of the data.
2. Background Description
Very often, organizations have large quantities of machine readable text documents to which they would like to assign labels for such purposes as developing a categorizer for new texts, enabling the retrieval of old texts, and the like. These text documents could be various electronic documents, including, among other things, Web pages (the World Wide Web (WWW) portion of the Internet, or simply “the Web”), electronic mail (i.e., e-mail), a collection of Frequently Asked Questions (FAQs). Current solutions to labeling such text documents usually include a large amount of costly manual labor, and cannot be completely automated (e.g., they require manual intervention).
SUMMARY OF THE INVENTION
It is therefore an object of the invention to provide a method of automatically labeling of unlabeled text data, independent of human intervention, but that does not preclude manual intervention.
It is another object of the present invention to provide a method to extract relevant features of unlabeled text data for a keyword search; that is, an automatic method of adding appropriate linguistic variants as part of an indexing mechanism.
According to the invention, there is provided a method of automated labeling of unlabeled text data. A document collection is established as a reference answer set. A label, e.g., the URL of a Web page, is attached to each document. Members of the answer set are converted to vectors representing centroids of clusters of documents. Unlabeled text data are categorized relative to the centroids by a nearest neighbor algorithm. Then, a supervised machine learning algorithm is trained on the newly labeled data, and a categorization classifier (e.g., a rule based classifier) classifies the data for each cluster. Alternatively, a feature extraction algorithm may be run on classes generated by the step of categorizing, and search features output which index the unlabeled text data.
Although the invention contemplates a fully automated process of categorizing unlabeled text data or extracting relevant features from the unlabeled text data for keyword search, human intervention may optionally be used to further refine the process. For example, the automated categorizations might be manually checked and updated by shifting documents from one cluster to another and thereafter the data re-categorized using a nearest neighbor algorithm. These steps would then be iterated until the process stabilizes or some iteration parameter reached. Also, the document collection established as the reference answer set might be manually augmented and/or edited with additional information useful to the categorization process, e.g., synonyms of words occurring in the documents.
The method of this invention may use information from several disparate and separate sources, such as a Web site, a database of Frequently Asked Questions (FAQs), and/or databases of other document collections, a the reference answer set. Sets of related Universal Resource Locators (URLs) could also be used in the categorization process.
REFERENCES:
patent: 5684940 (1997-11-01), Freeman et al.
patent: 5724072 (1998-03-01), Freeman et al.
patent: 6263334 (2001-07-01), Fayyad et al.
patent: 6598054 (2003-07-01), Schuetze et al.
patent: 6611825 (2003-08-01), Billheimer et al.
Buskirk, Jr. Martin C.
Damerau Frederick J.
Johnson David E.
Kaufman Stephen C.
Shah Sanjiv
Whitham Curtis & Christofferson, P.C.
LandOfFree
Automatic labeling of unlabeled text data does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Automatic labeling of unlabeled text data, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Automatic labeling of unlabeled text data will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3300082