Method and system for automatic comparison of text...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06397215

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to a system and method for automatic generation of a comparison list given two different classifications.
BACKGROUND OF THE INVENTION
Document classification, or grouping of documents, provides a means for a reader to quickly locate a set of similar documents that are most relevant to the reader's needs. In the past, such classifications were generated manually by a human expert or automatically via a computer program that compares the text of different documents based on frequency of word occurrence. Examples of electronic document classifications include folders of email messages, categorizations of help desk problem tickets, and logical groupings of research abstracts by subject.
A problem arises when comparing two different classifications in a domain of similar or identical documents. Different classifications may arise either because of a change in the method for generating a classification (e.g. human expert vs. automatic) or because the underlying set of documents being classified has changed (e.g. additional documents being authored over time). A comparison consists of a list in which each of the classes contained in one classification is matched with the single most similar class in a second classification. Past approaches to this problem have focussed primarily on comparing classifications on the same document set where the primary goal has been to find out which classification was better or more complete. A need arises for a technique which will provide automatic generation of such a list given two different classifications, and automatic sorting of the list in order of similarity.
SUMMARY OF THE INVENTION
The present invention is a system and method for automatic generation of a comparison list given two different classifications, and automatic sorting of the list in order of similarity. The two classifications may be over the same set of documents or two different (but somewhat similar) sets of documents. The approach of this invention is more flexible than past approaches, since it can apply to classifications on different document sets. The present invention does not discover which classification is “better”, but rather discovers the key similarities and differences between classifications.
In order to perform the method of the present invention, a first dictionary is generated including a subset of words contained in a first document set, the first document set including at least one document and having an associated first classification including at least one class, each class having a class name. A second dictionary is generated including a subset of words contained in a second document set, the second document set including at least one document and having an associated second classification including at least one class, each class having a class name. A common dictionary including words that are common to both the first dictionary and the second dictionary is generated. A count of occurrences of each word in the common dictionary within each document in each document set is generated. A centroid of each class in the space of the common dictionary is generated. A nearest centroid in the second classification for each centroid in the first classification is determined. A list is generated including class names of each class in the first classification and a class name of a corresponding nearest class in the second classification and the class names in the first classification are sorted based on a distance from a nearest centroid in the second classification.
According to one aspect of the present invention, the count of occurrences is generated by generating a matrix having rows and columns, each column corresponding to a word in the common dictionary, each column corresponding to a document, and each entry representing a number of occurrences of the corresponding word in the corresponding document.
According to another aspect of the present invention, the centroid of each class is generated by generating a vector having a plurality of entries, each entry corresponding to a word in the common dictionary and having a value equal to an average of the values of the entries in the matrix corresponding to the word in the common dictionary.
According to another aspect of the present invention, the nearest centroid in the second classification for each centroid in the first classification is determined by, for each centroid in the first classification, determining a distance between the centroid in the first classification and each centroid in the second classification; and selecting a centroid in the second classification having a least distance from the centroid in the first classification.
According to another aspect of the present invention, the distance between centroids is determined using a distance function of:
d

(
X
,
Y
)
=
-
X
·
Y
&LeftDoubleBracketingBar;
X
&RightDoubleBracketingBar;
·
&LeftDoubleBracketingBar;
Y
&RightDoubleBracketingBar;
,
wherein X is the centroid in the first classification, Y is the centroid in the second classification, and d(X,Y) is the distance between the centroid in the first classification and centroid in the second classification.


REFERENCES:
patent: 5463773 (1995-10-01), Sakakibara et al.
patent: 5832470 (1998-11-01), Morita et al.
patent: 5857179 (1999-01-01), Vaithyanathan et al.
patent: 6128613 (2000-10-01), Wong et al.
patent: 6137911 (2000-10-01), Zhilyaev
patent: 6185550 (2001-02-01), Snow et al.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and system for automatic comparison of text... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and system for automatic comparison of text..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for automatic comparison of text... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2872346

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.