Automatic text classification system

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06556987

ABSTRACT:

The present invention relates to an automatic text classification system, and more specifically to a system for automatically classifying texts in terms of each of a plurality of qualities in a manner such that the classified texts can be automatically retrieved based on a specified one or more of the plurality of qualities. The invention also relates to a retrieval system using the plurality of qualities.
BACKGROUND OF THE INVENTION
A variety of methods are known for automatically classifying and/or analyzing text, including keyword searching, collaborative filtering, and natural language parsing.
Keyword searching methods operate by simply looking for one or more keywords in a text and then classifying the text based on the occurrence (or non-occurrence) of the keywords. Keyword searching methods, however, suffer from the drawbacks that the main concept or a given text may be unrelated to the keywords being searched, and/or that a particularly relevant text may not contain the keywords being searched.
Collaborative filtering methods work by attempting to make recommendations and/or classifications based on matching overlapping results. For example, if a collaborative filtering system were used to analyze a series of questionnaires asking people to name their favourite musicians, the system would analyze the questionnaires by looking for an overlap in one or more of the musicians named in respective questionnaires. If an overlap were found between two questionnaires, the other musicians named by the author of the first questionnaire would be recommended to the author of the second questionnaire, and vice versa. The drawback of collaborative filtering, however, is that it assumes that people's tastes that are similar in one respect are also similar in other respects. That is, collaborative filtering methods fail to take into account the underlying qualities that define people's tastes.
Natural language parsing methods operate by performing semantic or lexical analysis based on rules of grammar and lexicons. To date, however, computers have been unable to fully understand natural language, and known natural language parsing methods too often misinterpret the actual meaning of text.
The above described drawbacks of keyword searching, collaborative filtering, and natural language parsing have created a need for more accurate and more meaningful text classification methods.
Recently, a company called Autonomy, Inc. has developed technology that is capable of analyzing text and identifying and ranking main ideas. As disclosed in the “Autonomy Technology Whitepaper” (available at www.autonomy.com), Autonomy's technology can analyze text and identify key concepts based on a statistical probability analysis of the frequency and relationships of terms in the text that give the text meaning. Once the key concepts have been extracted from a text, “Concept Agents” are created to seek out similar ideas in any other texts such as websites, news feeds, email archives or other documents. In addition, the “Autonomy Technology Whitepaper” discloses that the “Concept Agents” can be used to create specific user profiles based on an analysis of the texts that a particular user reads, or that the “Concept Agents” can be used to make users aware of others with similar interests. Still further, the “Autonomy Technology Whitepaper” discloses that the “Concept Agents” can be used to automatically sort documents into predefined categories.
Indeed, by identifying key concepts based on a statistical probability analysis of the frequency and relationships of terms in a text that give the text meaning, Autonomy's technology represents a significant advance over other known text searching techniques. However, by focusing on key concepts or “Concept Agents”, Autonomy's technology fails to identify the underlying qualities of the subject matter described in the text.
For example, if Autonomy's technology were used to analyze a textual film synopsis, the extracted key concept would be films, and the film might even be classified into a predefined category such as comedy, romance, action/adventure or science fiction. However, Autonomy's technology would fail to identify whether the text relates to, for example, a happy or sad film, a funny or serious film, a beautiful or repulsive film, a tame or sexy film, and/or a weird or conventional film. In this connection, it is pointed out that a romantic film, for example, can be each of happy or sad, funny or serious, beautiful or repulsive, tame or sexy, and weird or conventional. Accordingly, if a user were to access a data base of textual film synopses classified using Autonomy's technology, the user would only be able to search for a desired film within the static, predefined categories into which the films were classified. Thus, if a user wanted to find a film that is each of happy, funny, repulsive, sexy and weird, Autonomy's technology would be of little help.
OBJECTS OF THE INVENTION
It is an object of the present invention to provide a system for automatically classifying texts in terms of each of a plurality of qualities that are determined based on a statistical probability analysis of the frequency and relationships of words in the text.
It is also an object of the present invention to provide a system for automatically classifying texts in a manner such that the classified texts can be automatically retrieved using a “fuzzy logic” retrieval system capable of identifying a best match based on a specified one or more of a plurality of qualities.
SUMMARY OF THE INVENTION
An automatic text classification system is provided which extracts words and word sequences from a text or texts to be analysed. The extracted words and word sequences are compared with traiding data comprising words and word sequences together with a measure of probability with respect to the plurality of qualities. Each of the plurality of qualities may be represented by an axis whose two end points correspond to mutually exclusive characteristics. Based on the comparison, the texts to be analysed are then classified in terms of the plurality of qualities. In addition, a fuzzy logic retrieval system and a system for generating the training data are provided.


REFERENCES:
patent: 5325298 (1994-06-01), Gallant
patent: 5781879 (1998-07-01), Arnold et al.
patent: 5905980 (1999-05-01), Masuichi et al.
patent: 5933822 (1999-08-01), Braden-Harder et al.
patent: 5941944 (1999-08-01), Messerly
patent: 5974412 (1999-10-01), Hazlehurst et al.
patent: 6125362 (2000-09-01), Elworthy
patent: 6161130 (2000-12-01), Horvitz et al.
patent: 6192360 (2001-02-01), Dumais et al.
patent: 6253169 (2001-06-01), Apte et al.
patent: 6289353 (2001-09-01), Hazlehurst et al.
patent: 6389436 (2002-05-01), Chakrabarti et al.
patent: 6418432 (2002-07-01), Cohen et al.
“Autonomy Technology Whitepaper”: Autonomy—Knowledge Management and New Media Content Solutions, Apr. 6, 2000, pp. 1-11.
“Autonomy Unveils New Platform for Applications Using Unstructured Data”: Autonomy—Knowledge Management and New Media Content Solutions, Apr. 6, 2000, pp. 1-6.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Automatic text classification system does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Automatic text classification system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Automatic text classification system will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3038317

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.