Automatic labeling of unlabeled text data

Data processing: presentation processing of document – operator i – Presentation processing of document – Layout

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C706S045000

Reexamination Certificate

active

06697998

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to a method of automated labeling of unlabeled text data and, more particularly, to a method that assigns labels without manual intervention and can also be used to extract relevant features for a keyword search of the data.
2. Background Description
Very often, organizations have large quantities of machine readable text documents to which they would like to assign labels for such purposes as developing a categorizer for new texts, enabling the retrieval of old texts, and the like. These text documents could be various electronic documents, including, among other things, Web pages (the World Wide Web (WWW) portion of the Internet, or simply “the Web”), electronic mail (i.e., e-mail), a collection of Frequently Asked Questions (FAQs). Current solutions to labeling such text documents usually include a large amount of costly manual labor, and cannot be completely automated (e.g., they require manual intervention).
SUMMARY OF THE INVENTION
It is therefore an object of the invention to provide a method of automatically labeling of unlabeled text data, independent of human intervention, but that does not preclude manual intervention.
It is another object of the present invention to provide a method to extract relevant features of unlabeled text data for a keyword search; that is, an automatic method of adding appropriate linguistic variants as part of an indexing mechanism.
According to the invention, there is provided a method of automated labeling of unlabeled text data. A document collection is established as a reference answer set. A label, e.g., the URL of a Web page, is attached to each document. Members of the answer set are converted to vectors representing centroids of clusters of documents. Unlabeled text data are categorized relative to the centroids by a nearest neighbor algorithm. Then, a supervised machine learning algorithm is trained on the newly labeled data, and a categorization classifier (e.g., a rule based classifier) classifies the data for each cluster. Alternatively, a feature extraction algorithm may be run on classes generated by the step of categorizing, and search features output which index the unlabeled text data.
Although the invention contemplates a fully automated process of categorizing unlabeled text data or extracting relevant features from the unlabeled text data for keyword search, human intervention may optionally be used to further refine the process. For example, the automated categorizations might be manually checked and updated by shifting documents from one cluster to another and thereafter the data re-categorized using a nearest neighbor algorithm. These steps would then be iterated until the process stabilizes or some iteration parameter reached. Also, the document collection established as the reference answer set might be manually augmented and/or edited with additional information useful to the categorization process, e.g., synonyms of words occurring in the documents.
The method of this invention may use information from several disparate and separate sources, such as a Web site, a database of Frequently Asked Questions (FAQs), and/or databases of other document collections, a the reference answer set. Sets of related Universal Resource Locators (URLs) could also be used in the categorization process.


REFERENCES:
patent: 5684940 (1997-11-01), Freeman et al.
patent: 5724072 (1998-03-01), Freeman et al.
patent: 6263334 (2001-07-01), Fayyad et al.
patent: 6598054 (2003-07-01), Schuetze et al.
patent: 6611825 (2003-08-01), Billheimer et al.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Automatic labeling of unlabeled text data does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Automatic labeling of unlabeled text data, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Automatic labeling of unlabeled text data will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3300082

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.