Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-11-25
2001-07-03
Black, Thomas (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C704S002000
Reexamination Certificate
active
06256629
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to word sense disambiguation techniques, and more particularly, to a method and apparatus for identifying ambiguous words and for measuring their degree of ambiguity.
BACKGROUND OF THE INVENTION
Many words have multiple senses. Such words are often referred to as “polysemous words.” For example, the word “bass” has two main senses, namely, a type of fish and a musical range. Word sense disambiguation techniques assign sense labels to each instance of an ambiguous word. Information retrieval systems, for example, are plagued by the ambiguity of language. Searches on the word “crane” will retrieve documents about birds, as well as documents about construction equipment. The user, however, is typically interested in only one sense of the word. Generally, the user must review the various documents returned by the information retrieval system to determine which returned documents are likely to be of interest. Of course, the user could narrow the search using boolean expressions of fuller phrases, such as a search in the form: “crane NEAR (whooping OR bird OR lakes . . . ).” The user, however, risks missing some examples that do not happen to match the required elements in the boolean expression.
Word sense disambiguation is important in other applications as well, such as in a text-to-speech converter where a different sense may involve a pronunciation difference. Generally, word sense disambiguation techniques presume that a set of polysemous terms is known beforehand. Of course, published lists of polysemous terms invariably provide only partial coverage. For example, the English word “tan” has several obvious senses. A published list of polysemous terms, however, may not include the abbreviation for “tangent.” One such published list of polysemous terms is WordNet, an on-line lexical reference system developed by the Cognitive Science Laboratory at Princeton University in Princeton, N.J. See, for example, http://www.cogsci.princeton.edu/~wn/.
Hinrich Schütze has proposed a word sense disambiguation technique that identifies multiple senses of a target word by computing similarities among words that cooccur in a given corpus with a target word. Generally, the Schütze word sense disambiguation technique uses a corpus to compute vectors of word counts for each extant sense of a known ambiguous word. The result is a set of vectors, one for each sense, that can be used to classify a new instance. For a more detailed discussion of the Schütze word sense disambiguation technique, see H. Schütze, “Automatic Word Sense Discrimination,” Computational Linguistics, V. 24, No. 1, 97-123 (1998).
While the Schütze technique provides an effective tool for identifying ambiguous words, the Schütze technique does not quantify how ambiguous a given word is in a given corpus. The utility of such word sense disambiguation technique could be further extended if they could rank words by their degree of ambiguity. Thus, a need exists for a word ambiguity detection tool that identifies ambiguous words and quantifies their degree of ambiguity.
SUMMARY OF THE INVENTION
Generally, a method and apparatus are disclosed for identifying polysemous terms and for measuring their degree of polysemy, given an unlabeled corpus. A polysemy index provides a quantitative measure of how polysemous a word is, in a given corpus. Thus, the present invention can rank a list of words by their polysemy indices, with the most polysemous words appearing at the top of the list.
According to one aspect of the invention, a polysemy evaluation process initially collects a set of terms within a certain window of a target term. Thereafter, the inter-term distances of the set of terms occurring near the target term are computed. The multi-dimensional distance space is reduced to two dimensions using well-known dimension reduction techniques. The two dimensional representation is converted into radial coordinates. Isotonic/antitonic regression techniques are used to compute the degree to which the distribution deviates from unimodality. The amount of deviation is the polysemy index.
According to another aspect of the invention, once the ambiguity of a word has been quantitatively measured, a corpus can be preprocessed to identify words having clearly separated senses. Thereafter, if a user of an information retrieval system enters a query on a word having clearly separated senses, the information retrieval system can return a separate list of documents for each sense of the word. The user can then select the desired sense and focus on one of the returned list of documents.
According to yet another aspect of the invention, a self-organizing sense disambiguation technique selects canonical contexts for the various senses identified for a given word. Contexts are selected containing terms falling in radial bins near each peak. Such contexts can then be used for subsequent training of a classifier. Thus, the present invention identifies words associated with the points that fall in bins near each sense-related peak in the distribution. Sentences containing the target word and one or more highly related words are good seed material for the self-organizing method.
A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description and drawings.
REFERENCES:
patent: 4839853 (1989-06-01), Deerwester et al.
patent: 5301109 (1994-04-01), Landauer et al.
patent: 5541836 (1996-07-01), Church et al.
patent: 5659766 (1997-08-01), Saund et al.
patent: 5778362 (1998-07-01), Deerwester
patent: 5828999 (1998-10-01), Bellegarda et al.
patent: 5950189 (1999-09-01), Cohen et al.
patent: 5987446 (1999-11-01), Corey et al.
patent: 6070134 (2000-05-01), Richardson et al.
patent: 6078878 (2000-06-01), Dolan
patent: 6098033 (2000-08-01), Richardson et al.
Bellegarda et al., A Novel World Clustering Algorithm Based on Latent Semantic Anlysis; Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings 1996 IEEE International Conference on, vol. 1, 1996 pp. 172-175 vol. 1.*
La Cascia et al., Combining Textual and Visual Cues for Content-based Image Retrieval on the World Wide Web; Content-Based Access of Image and Video Libraries, 1998. Proceedings. IEEE on, 1998. pp. 24-28.*
Maletic et al., Automatic Software Clustering via Latent Semantic Analysis; Automated Software Engineering, 1999. 14th IEEE International Conference pp. 251-254.*
Kurimo, Fast Latent Semantic Indexing of Spoken Documents by Using Self-organizing Maps; Acoustics, Speech, and Signal Process, 2000. ICASSP '00. Proceedings. 2000 International Conference on, Vol.: 6, 2000 pp. 2425-2428.*
Hinrich Schütze, Dimensions of Meaning, Proc. of Supercomputing '92 (1992).
Hinrich Schütze, Automatic Word Sense Discrimination, Computational Linguistics, vol. 24 No. 1, 97-123 (1998).
Sproat Richard William
VanSanten Jan Pieter
Black Thomas
Lucent Technologies - Inc.
Ryan & Mason & Lewis, LLP
Wang Mary
LandOfFree
Method and apparatus for measuring the degree of polysemy in... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for measuring the degree of polysemy in..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for measuring the degree of polysemy in... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2487106