Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-11-23
2001-10-16
Black, Thomas (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C704S010000
Reexamination Certificate
active
06304878
ABSTRACT:
FIELD OF THE INVENTION
The invention relates generally to computer systems, and more particularly to an improved method and system for arranging lexical data for searching.
BACKGROUND OF THE INVENTION
A trie is a data structure that is useful for compressing lexical data such as dictionary words. Tries are composed of states, with a top-level state representing, for example, each of the first letters (e.g., a-z) of all valid words in a given dictionary. Each state is comprised of nodes, wherein each node represents a valid letter in that state, along with some information about that letter, such as a pointer to a lower state (if any). Each state represents a transition from one character in a word to the next. For example, the letter “q ” in one state usually transitions to the letter “u” in a next lower state.
To use the trie, such as to find if a user-input word is a valid word in the dictionary, a search through the states is performed. For example, to find the word “the,” the top level state in the trie is searched until the “t” node is found, and then a next lower level state pointed to by the “t” node is searched to determine if there is an “h” node therein. If not, the word “the” would not be a valid word in that dictionary. However, if there is an “h” node in the state pointed to by the “t” node, the “h” node is examined to find a next state, if any. The state pointed to by the “h” node is then searched to find out whether there is an “e” node therein. If there is an “e” node, to be a valid word, the “e” node needs to be followed by some indication (e.g., a flag) indicating that a valid word exists at this time, regardless of whether the “e” node points to a further state. In a trie-structured dictionary that properly represents the English language, “the” would be a valid word, and thus the top-level state would have a “t” node, the next state pointed to by the “t” node would have an “h” node therein, and the state pointed to by that “h” node would have an “e” node therein with a valid flag set. If characters such as “thj” were searched, however, the “t” node would transition to the next state which would have an “h” node therein, but the next state pointed to by “h” node would not include a “j” node, and thus this word would not be a valid word.
In Western scripts such as English, the top level state is generally from fifty to eighty nodes in length, while states other than the top level state are usually two to ten nodes in length. To search the top-level state, the nodes initially have been arranged alphabetically, and linearly searched from left to right. To speed the search, it is known to reorder the nodes in a states based on lexical frequency, i.e., in the top-level state, the “s” node comes before a “k” node since more words begin with the “s” character than the “k” character. This speeds the search because on the average, less nodes need to be visited before a match is found.
However, in Eastern scripts such as Chinese or Japanese, the top level state can be over ten-thousand nodes in length. Regardless of how ordered, the average linear search through so many nodes takes too long to be used in ordinary applications on ordinary computers. Nevertheless, many existing algorithms that operate on tries expect the states to be linear. If the nodes in a state are arranged in some other manner, the existing linear-based algorithms fail. In short, there has heretofore not been an adequate way in which to enumerate a trie for faster searching while preserving the characteristics of linear states for existing linear search algorithms.
SUMMARY OF THE INVENTION
Briefly, the present invention provides an improved trie enumeration method, system and data structure, along with a method of searching same that operates with binary-like search speeds on states of nodes arranged linearly. The enumeration method for a given state selects a node within that state as a selected node, and converts the selected node to a skip node by moving it forward in the state relative to its original position therein. A pointer is set in the skip node to point to a further node in the state (i.e., a jump node), and a skip flag is set to identify the selected node as a skip node. Ordinarily, the initial skip node is selected from the middle of an alphabetically ordered state, placed at the beginning of the state, and provided with a pointer to its previous position, i.e., the jump node. The node directly before the jump node is set to be a soft terminal node.
As a result of the alphabetic ordering, the segment of nodes before the jump node are alphabetically before the skip node, while the segment of nodes after the jump node are alphabetically after the skip node. The segments of the state may be recursively split into further segments via further skip nodes, jump nodes and soft terminal nodes in each segment. A threshold number of nodes per segment may be used to determine whether to further split a segment. Once the segments are split as desired, (e.g., such that no segment has a number of nodes that exceeds the threshold), the nodes within the segment may be sorted, such as by lexical frequency.
To search the skip encoded state, the search starts at the beginning of the state. If a skip node is encountered that does not equal the search character, the search skips to the jump node if the search character is greater than the skip node, or to the next node (linearly to the right) if the search character is less than the skip node. The search of the nodes continues, essentially as a binary search when skip nodes are encountered, and a linear search otherwise, until a node is found that matches the search character or a soft or regular terminal node is reached.
Other advantages will become apparent from the following detailed description when taken in conjunction with the drawings, in which:
REFERENCES:
patent: 4730348 (1988-03-01), MacCrisken
patent: 4814746 (1989-03-01), Miller et al.
patent: 4853696 (1989-08-01), Mukherjee
patent: 4959785 (1990-09-01), Yamamoto et al.
patent: 5283894 (1994-02-01), Deran
patent: 5475588 (1995-12-01), Schabes et al.
patent: 5488717 (1996-01-01), Gibson et al.
patent: 5488719 (1996-01-01), Kaplan et al.
patent: 5532694 (1996-07-01), Mayers et al.
patent: 5592667 (1997-01-01), Bugajski
patent: 5625554 (1997-04-01), Cutting et al.
patent: 5651099 (1997-07-01), Konsella
patent: 5721899 (1998-02-01), Namba
patent: 5768423 (1998-06-01), Aref et al.
patent: 5778371 (1998-07-01), Fujihara
patent: 5781772 (1998-07-01), Wilkinson, III et al.
patent: 5799299 (1998-08-01), Fujiwara
patent: 5832428 (1998-11-01), Chow et al.
patent: 5951623 (1999-09-01), Reynar et al.
patent: 5966709 (1999-10-01), Zhang et al.
patent: 6009392 (1999-12-01), Kanevsky et al.
patent: 6131102 (2000-10-01), Potter
patent: 6182039 (2001-01-01), Rigazio et al.
patent: 6236959 (2001-05-01), Weise
patent: 41-0301596 A (1998-11-01), None
Knuth, Donald Ervin, “Sorting and Searching,”The Art of Computer Programming 2ndEdition, vol. 3, Addison-Wesley, pp. 492-496, 500-502, 507-509, 512, 576, 722 (1998).
Bennett John R.
Hullender Gregory N.
Karlov Donald D.
Black Thomas
Michalik & Wylie PLLC
Microsoft Corporation
Wang Mary
LandOfFree
Method and system for improved enumeration of tries does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for improved enumeration of tries, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for improved enumeration of tries will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2588219