Identifying a group of words using modified query words...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C704S009000, C704S010000

Reexamination Certificate

active

06430557

ABSTRACT:

FIELD OF THE INVENTION
The invention relates to identifying one of a number of groups of words.
BACKGROUND AND SUMMARY OF THE INVENTION
U.S. Pat. No. 5,551,049 discloses a technique in which information about one of a number of groups of words is determined by matching an identifier of a received word. The identifier is compared with word identifiers grouped in sequence to represent synonym groups. When a match is found, the group of identifiers that includes the matching identifier is used to obtain synonym group that includes the received word.
Frakes, W. B., “Stemming Algorithms”, in Frakes, W. B., and Baeza-Yates, R., Eds.,
Information Retrieval
, Prentice Hall, 1992, pp. 131-160, discloses a taxonomy for stemming algorithms that includes several automatic approaches. Affix removal algorithms remove suffixes and/or prefixes to obtain a stem, while table lookup methods perform lookup in a table in which terms and their corresponding stems are stored. Affix removal algorithms such as the Porter stemmer can, after removing characters according to a set of replacement rules, perform recoding, a context sensitive transformation, to change characters of the stem.
Hull, D. A., “Stemming Algorithms: A Case Study for Detailed Evaluation”,
Journal of the American Society for Information Science
, Vol. 47, No. 1, 1996, pp. 70-84, discloses a lexical database that can analyze and generate inflectional and derivational morphology. The inflectional database reduces each surface word to a dictionary form, while the derivational database reduces surface forms to stems that relate to the original in both form and semantics. The databases are constructed using finite state transducers (FSTs), which allows the conflation process to act in reverse, generating all conceivable surface forms from a single base form.
Salton, G., and McGill, M. J.,
Introduction to Modem Information Retrieval
, New York: McGraw-Hill, 1983, pp. 75-84 disclose information retrieval techniques that use a thesaurus. A thesaurus can be used, for example, in an automatic indexing environment, and when a document contains a term such as “superconductivity” or (stem “superconduct”), that term may be replaced by a class identifier for a class of words with related meanings. The same operation can be used for a user query containing a word in the class. Should the document contain “superconductivity” while the query term is “cryogenic”, a term match would result through thesaurus transformation. Rather than replacing an initial term with the corresponding class identifier, the thesaurus class identifier can be added to the original term.
The invention addresses basic problems that arise in using a word, referred to herein as a “query word”, to identify one of a number of groups of words. With conventional table lookup techniques, FST techniques, and other techniques that rely on matching the query word, an unknown query word will result in a failure. Affix algorithms like the Porter stemmer can stem any word, including an unknown query word, but conventional information retrieval techniques based on affix algorithms do not provide a fallback strategy if the first obtained stem does not relate the unknown word to any other word.
The invention is based on the discovery of a new technique for using a query word to identify one of a number of word groups. The technique first determines whether the query word is in any of the groups. If not, the technique attempts to modify the query word in accordance with successive suffix relationships in a sequence until it obtains a modified query word that is in one of the groups. The new technique in effect provides two stages of word group lookup—the first stage can be implemented by rapidly comparing a query word with the words in the groups, while the second stage is slower, because it includes attempting to obtain modified query words in accordance with suffix relationships.
The new technique can be implemented with an ordered list defining the sequence of suffix relationships, such as pairwise relationships. An attempt can be made to modify the query word in accordance with each relationship in the list until a modified query word is obtained that is in one of the groups. The relationships in the list can be ordered in accordance with their frequency of occurrence in a natural language, and modifications can be attempted beginning with the highest frequency suffix relationship. The ordered list can be automatically obtained as part of an automatic technique for producing the word groups.
The new technique can be implemented to attempt modifications of the query word iteratively, with each iteration attempting to modify the query word in accordance with a respective suffix relationship. If an iteration obtains a modified query word, it also determines whether the modified query word is in any of the word groups.
When a modified query word is obtained that is in one of the word groups, information identifying the word group can be provided, such as a representative of the group or a list of words in the group.
The new technique can further be implemented in a system that includes a query word and a processor that determines whether the query word is in any of the word groups. If not, the processor attempts to modify the query word in accordance with successive suffix relationships in a sequence until a modified query word is obtained that is in one of the word groups. The system can also include stored word group data indicating the word groups, and the word group data can be an FST data structure. The system can also include stored suffix relationship sequence data indicating the sequence of suffix relationships.
The new technique can also be implemented in an article of manufacture for use in a system that includes a storage medium access device. The article can include a storage medium and instruction data stored by the storage medium. The system's processor, in executing the instructions indicated by the instruction data, determines whether the query word is in any of the word groups. If not, the processor attempts to modify the query word in accordance with successive suffix relationships in a sequence until a modified query word is obtained that is in one of the word groups.
The new technique can also be implemented in a method of operating a first machine to transfer data to a second over a network, with the transferred data including instruction data as described above.
The new technique is advantageous because it allows more robust word group identification than conventional techniques. In comparison with conventional word matching techniques, the new technique is advantageous because it allows the use of an unknown word or a word that does not exactly match, which can arise, for example, when a user provides a query word that is not in any of the word groups. In comparison with conventional affix removal techniques, the new technique can continue to obtain modified query words until it finds one that is in one of the word groups, so that it does not fail if the first modified query word is not in any of the word groups.
The new technique is also advantageous because it allows for fully automatic implementation. One fully automatic implementation automatically obtains an ordered list of suffix relationships in automatically producing a word group data structure.
The following description, the drawings, and the claims further set forth these and other aspects, objects, features, and advantages of the invention.


REFERENCES:
patent: 4799188 (1989-01-01), Yoshimura
patent: 4864501 (1989-09-01), Kucera et al.
patent: 5488725 (1996-01-01), Turtle et al.
patent: 5551049 (1996-08-01), Kaplan et al.
patent: 5594641 (1997-01-01), Kaplan et al.
patent: 5625554 (1997-04-01), Cutting et al.
patent: 5696962 (1997-12-01), Kupiec
patent: 5940624 (1999-08-01), Kadashevich et al.
patent: 5963940 (1999-10-01), Liddy et al.
patent: 6012053 (2000-01-01), Pant et al.
patent: 6081774 (2000-06-01), de Hita et al.
patent: 6092065 (2000-07-01), Floratos et al.
patent: 6101492 (2000-08-01), Ja

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Identifying a group of words using modified query words... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Identifying a group of words using modified query words..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Identifying a group of words using modified query words... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2897298

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.