Method and system for attaching information to words of a trie

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C715S252000, C715S252000, C704S009000, C704S010000

Reexamination Certificate

active

06675169

ABSTRACT:

FIELD OF THE INVENTION
The invention relates generally to computer systems, and more particularly to an improved method and system for storing lexical data and attaching information thereto.
BACKGROUND OF THE INVENTION
A trie is a data structure that is useful for compressing lexical data such as a list of dictionary words. Tries are composed of states, with a top-level state representing, for example, each of the first letters (e.g., a-z) of all valid words in a given dictionary. Each state is comprised of nodes, wherein each node represents a valid letter in that state, along with some information about that letter, such as a pointer to a lower state (if any). Each state represents a transition from one character in a word to the next. For example, the letter “q” in one state usually transitions to the letter “u” in a next lower state.
To use the trie, such as to find if a user-input word is a valid word in the dictionary, a search through the states is performed. For example, to find the word “the,” the top-level state in the trie is searched until the “t” node is found, and then a next lower level state pointed to by the “t” node is searched to determine if there is an “h” node therein. If not, the word “the” would not be a valid word in that dictionary. However, if there is an “h” node in the state pointed to by the “t” node, the “h” node is examined to find a next state, if any. The state pointed to by the “h” node is then searched to find out whether there is an “e” node therein. If there is an “e” node, to be a valid word, the “e” node needs to be followed by some indication (e.g., a flag) indicating that a valid word exists at this time, regardless of whether the “e” node points to a further state. In a trie-structured dictionary that properly represents a list of words in the English language, “the” would be a valid word, and thus the top-level state would have a “t” node, the next state pointed to by the “t” node would have an “h” node therein, and the state pointed to by that “h” node would have an “e” node therein with a valid flag set. If characters such as “thj” were searched, however, the “t” node would transition to the next state which would have an “h” node therein, but the next state pointed to by “h” node would not include a “j” node, and thus this word would not be a valid word.
While storing words in a trie structure is efficient in terms of both storage and access time, it is difficult to attach information to individual words in the trie. One known way to attach information to certain individual words stored in a trie is to tag selected words by setting a single “tag” bit in the last node of each selected word. Tagging is useful for identifying a small or regular subset of words for special processing upon decompression. For example, some words are slang words, which although acceptable (e.g., to a spell checker), are not recommended (e.g., by a thesaurus). If a trie is used to store words, the slang words can be tagged, whereby upon decompression, those words stand out from the rest. Then, the spell checker may ignore the tag, while the thesaurus may recognize the tag and thereby delete or change the appearance of the word in a list of synonyms presented to a user.
Another technique for associating information with words is known as global enumeration. Global enumeration is a technique that maps each word in the word list to a number and maps that number back to the same word, i.e., the number may be used to determine its associated word, and vice-versa. The numbers are dense, e.g. if there are N words in the list, then the words map to the range zero to N minus one. The number may serve as an index to information associated with specific words, which is useful if the same type of information is attached to every (or most) words in the list with little or no pattern. For example, the words in a thesaurus may be stored in a trie and enumerated, whereby the number associated with each word may serve as an index to a table of synonyms, a table of antonyms and so on. The tables themselves may be lists of numbers representing associated words that map back to the trie. By way of example, the user may want a synonym for a word that is enumerated in the trie as 957, whereby 957 is used as an index to a table of synonyms, resulting in the numbers 2040, 902 and 457 being retrieved. Those retrieved values are then used to find their corresponding words in the trie for display to a user.
While tagging and enumeration are thus helpful techniques, they are essentially limited to solving only their specific types of problems, i.e., marking certain words, or associating each of the words in a trie with a unique indexing number. Thus, these solutions work in certain circumstances, however there are many word lists that would benefit from having additional information stored with the word, and the existing techniques are neither flexible enough nor extensible to solve the problem in an efficient manner. For example, certain languages have gender associated with certain words, but not all words. Thus, a single bit is not sufficient to represent male, female or gender neutral. Separately tagging more than one subset of words can be done by setting aside an additional bit in each node for each additional subset, (e.g., one bit for gender or not, one bit for male or female), however reserving such tagging bits in each node reduces compression. While enumeration could be used to store the related gender information in an indexed table, enumeration requires the storing of numbers with the nodes, which in some instances is very inefficient, such as if enumeration is not otherwise needed and only a few words need such associated information.
SUMMARY OF THE INVENTION
Briefly, the present invention provides a method and system and accompanying data structure for the improved attaching of additional information onto words in a trie. The present invention is generally accomplished by providing a framework within the trie data structure capable of storing multiple tags with individual words, wherein some or all of the tags may further have associated values, and/or by separately enumerating some or all the subsets of tagged words (partial enumeration) in the trie, independent of whether global enumeration of all words is in use. To accomplish multiple tagging, the single tag bit on the last node of a word may be interpreted in a new way, as specified by information placed in a header of the trie. If set, it indicates that a further block of bits (e.g., a byte) is included in the node, which comprises a bitmask specifying which of a plurality of tags are set on that particular node. Header information may also specify which (if any) of the tags have associated values, which are then stored in association with each node having such a tag.
Partial enumeration of tagged items is provided by storing a count of the tagged words under a node. Multiple tags may be selectively and separately enumerated. Header information indicates how the enumeration is arranged, e.g., which of the plurality of tags are enumerated. Partial enumeration may be combined with global enumeration, with multiple tags, and/or with tags that have values, providing a flexible, extensible and efficient way to attach information to words in a trie.


REFERENCES:
patent: 4730348 (1988-03-01), MacCrisken
patent: 4814746 (1989-03-01), Miller et al.
patent: 4853696 (1989-08-01), Mukherjee
patent: 4959785 (1990-09-01), Yamamoto et al.
patent: 5283894 (1994-02-01), Deran
patent: 5475588 (1995-12-01), Schabes et al.
patent: 5488717 (1996-01-01), Gibson et al.
patent: 5488719 (1996-01-01), Kaplan et al.
patent: 5532694 (1996-07-01), Mayers et al.
patent: 5592667 (1997-01-01), Bugajski
patent: 5625554 (1997-04-01), Cutting et al.
patent: 5651099 (1997-07-01), Konsella
patent: 5721899 (1998-02-01), Namba
patent: 5768423 (1998-06-01), Aref et al.
patent: 5778371 (1998-07-01), Fujihara
patent: 5781772 (1998-07-01), Wilkinson, III et al.
patent: 5799299 (1998-08-01), Fujiwara
patent: 5832428 (1998-11-01), Chow et al.
patent: 595

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and system for attaching information to words of a trie does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and system for attaching information to words of a trie, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for attaching information to words of a trie will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3254557

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.