Trie compression using substates and utilizing pointers to...

Data processing: speech signal processing – linguistics – language – Linguistics – Dictionary building – modification – or prioritization

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C701S001000, C707S793000, C707S793000

Reexamination Certificate

active

06298321

ABSTRACT:

FIELD OF THE INVENTION
The invention relates generally to computer systems, and more particularly to an improved method and system for compressing lexical data.
BACKGROUND OF THE INVENTION
A trie is a data structure that is useful for compressing lexical data such as dictionary words. Tries are composed of states, with a top-level state representing, for example, each of the first letters (e.g., a-z) of all valid words in a given dictionary. Each state is comprised of nodes, wherein each node represents a valid letter in that state, along with some information about that letter, such as a pointer to a lower state (if any). Each state represents a transition from one character in a word to the next. For example, the letter “q” in one state usually transitions to the letter “u” in a next lower state.
To use the trie, such as to find if a user-input word is a valid word in the dictionary, a search through the states is performed. For example, to find the word “the,” the top-level state in the trie is searched until the “t” node is found, and then a next lower level state pointed to by the “t” node is searched to determine if there is an “h” node therein. If not, the word “the” would not be a valid word in that dictionary. However, if there is an “h” node in the state pointed to by the “t” node, the “h” node is examined to find a next state, if any. The state pointed to by the “h” node is then searched to find out whether there is an “e” node therein. If there is an “e” node, to be a valid word, the “e” node needs to be followed by some indication (e.g., a flag) indicating that a valid word exists at this time, regardless of whether the “e” node points to a further state. In a trie-structured dictionary that properly represents the English language, “the” would be a valid word, and thus the top-level state would have a “t” node, the next state pointed to by the “t” node would have an “h” node therein, and the state pointed to by that “h” node would have an “e” node therein with a valid flag set. If characters such as “thj” were searched, however, the “t” node would transition to the next state, which would have an “h” node therein, but the next state pointed to by “h” node would not include a “j” node, and thus this word would not be a valid word.
Tries are used in many types of devices, including those wherein storage space is at a premium. To save space, tries are compressed by using known compression techniques, including those that attempt to efficiently store the information in the trie. Previous compression technologies exploited similarities in both the prefixes and suffixes of words, known as head merging and tail merging, respectively. In head merging, for example, all words in a trie that begin with “ja” share the “j” of the top level state, which points to a next level state with a single “a” node therein. In tail merging, for example, all words that end with an “s” essentially end with the same information, i.e., an “s” node that is marked as terminal, and thus may share a single “s” terminal state.
While tail merging saves a significant amount of space, tail merging is limited in that only completely identical subtrees in the trie may be merged. In other words, tail merging cannot be used where subtrees are only partially the same. This limits its usefulness as a compression technique, particularly in languages such as English wherein there are many exceptions to the way words are spelled. For example, in a (limited) dictionary the words “be't'” and “we't'” may share the same endings (suffixes) of “s',” “ter'” and “ting',” where the apostrophe (') represents a valid word flag. However if “be't'” has a further suffix of “tor'” that is not shared by “we't,” only the “r'” and the “ng'” endings may be merged via tail compression. In sum, even though the subtrees are nearly identical, only the parts thereof that are actually identical may be shared in tail compression.
SUMMARY OF THE INVENTION
Briefly, the present invention provides an improved trie compression method using substate compression such that partially identical subtrees may be merged to some extent. To this end, states of the trie are selected, and the nodes of those states examined find nodes that are identical to one another. The most frequently occurring identical node is selected as a substate, and the states are separated into a first group of states that have the substate node therein and a second group of states that do not. The nodes in the first group of states are reordered such that the substate is at the end thereof. Then, the substate of each state is merged into a single node, and the node preceding the substate node provided with a right pointer thereto. The substate compression is performed recursively on the remaining nodes of the first group, and on the second group, i.e., a new most frequent node is selected as the substate, which is then merged, until no further identical nodes are available for merging. Essentially, compression is achieved by replacing identical nodes with pointers thereto.
Other advantages will become apparent from the following detailed description when taken in conjunction with the drawings, in which:


REFERENCES:
patent: 4730348 (1988-03-01), MacCrisken
patent: 4814746 (1989-03-01), Miller et al.
patent: 4853696 (1989-08-01), Mukherjee
patent: 4959785 (1990-09-01), Yamamoto et al.
patent: 5283894 (1994-02-01), Deran
patent: 5475588 (1995-12-01), Schabes et al.
patent: 5488717 (1996-01-01), Gibson et al.
patent: 5488719 (1996-01-01), Kaplan et al.
patent: 5532694 (1996-07-01), Mayers et al.
patent: 5592667 (1997-01-01), Bugajski
patent: 5625554 (1997-04-01), Cutting et al.
patent: 5651099 (1997-07-01), Konsella
patent: 5721899 (1998-02-01), Namba
patent: 5768423 (1998-06-01), Aref et al.
patent: 5778371 (1998-07-01), Fujihara
patent: 5781772 (1998-07-01), Wilkinson, III et al.
patent: 5799299 (1998-08-01), Fujiwara
patent: 5832428 (1998-11-01), Chow et al.
patent: 5951623 (1999-09-01), Reynar et al.
patent: 5966709 (1999-10-01), Zhang et al.
patent: 6009392 (1999-12-01), Kanevsky et al.
patent: 6131102 (2000-10-01), Potter
patent: 6182039 (2001-01-01), Rigazio et al.
patent: 6236959 (2001-05-01), Weise
patent: 41-0301596A (1998-11-01), None
Knuth, Donald Ervin, “Sorting and Searching,”The Art of Computer Programming2ndEdition, vol. 3, Addison-Wesley, pp. 492-496, 500-502, 507-509, 512, 576, 722 (1998).

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Trie compression using substates and utilizing pointers to... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Trie compression using substates and utilizing pointers to..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Trie compression using substates and utilizing pointers to... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2612428

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.