Method, system, program, and data structure for a dense...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C704S010000

Reexamination Certificate

active

06470347

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The field of the present invention relates to a method, system, program, and data structure for a dense array storing character strings that provides improved compression over current array structures for storing character strings.
2. Description of the Related Art
Many operations on text require or benefit from some type of dictionary. A “dictionary” generally comprises an abstract list of character strings to represent and the data structure to store that list of character strings. A “word” is one of the individual character strings stored in the dictionary. An “alphabet” refers to the repertoire of characters needed to represent all of the words in the dictionary. For instance, a dictionary of English words has an alphabet consisting of the letters of the Latin alphabet, plus some punctuation marks, such as the hyphen and apostrophe, that can appear in the middle of a word. In French, the alphabet would also include certain accented letters used in French; in Arabic, the alphabet would consist of the Arabic alphabet, plus various Arabic punctuation marks, but not the letters from the Latin alphabet.
Dictionaries are used for spell-checking, hyphenation, and other types of morphological analysis. However, storing a long list of words (the English language, for example, has some 30,000 words in common use, out of a total vocabulary of well over 100,000 words) can consume a substantial amount of storage space and be quite cumbersome to search for a particular word.
Two general schemes to store lists of objects in computer memory include arrays and linked data structures. An array is a list where all of the items are the same size and stored contiguously in a single range of memory. Accessing a single item is direct and fast, because the address of the item can be calculated easily from the address of the array and the item's line number. A linked list, on the other hand, is a structure where each item in the list is stored in an independent range of memory along with the address of the next item in the list. The memory block containing the item data and the pointer to the next item is called a “node.” Accessing items in the list can require searching numerous nodes. Typically, a search starts at the first node in the list and follows the links until the desired item is reached. This can be especially time consuming if the linked list has a substantial number of nodes, such as the number of nodes needed to represent the words in a dictionary.
One family of linked data structures are called “trees.” In a tree, each node in the tree may point to two or more further nodes. The nodes a node points to are called its “children.” Every node in the tree is pointed to by only one other node (its “parent”), except for the “root node,” through which the entire tree is accessed. A node with no children is called a “leaf.” One of the most common abstract representations of a tree data structure is called a “trie.” The nodes in a trie represent only a single character in the word of which they are a part. Each node can have an arbitrary number of children (in the dictionary application, the maximum number of children is the number of characters in the alphabet plus one). This approach allows redundancy to be squeezed out of a list of words because all words starting with the same letters share the nodes representing those letters.
FIG. 1
illustrates a trie implementing a tree data structure such that any node can point to any number of additional nodes. The trie in
FIG. 1
stores the characters in the phrase “Now is the time for all good men to come the aid of their country.” Each child of the root node has the first character in each word of the phrase, each child of the child of the root has the second character in each word beginning with the character at the child node to the root node. Thus, the i children of each jth node includes the (j+1)th character in the i strings having the previous j characters at the nodes on branches contiguously connecting the root node to the jth node. An empty node indicates an end of a string or word. Examples of strings that share the same node include “come” and “country” that share the nodes for “co” and all of the words that start with “t” are children of the “t” node.
FIG. 2
illustrates a trie represented as a binary tree also storing the characters of the phrase “Now is the time for all good men to come to the aid of their country.” Each node in the trie is represented by one or more nodes in the binary tree. The root node of the trie, for example, is represented by the chain of nodes going down the right-hand side of the diagram (the nodes here have been rearranged into alphabetical order, a technique that can speed up the search). The “a” node in
FIG. 1
is represented by the “i” and “l” nodes in the upper left-hand corner of FIG.
2
. To search the trie represented as a binary tree as shown in
FIG. 2
, the algorithm would follow the below steps:
1. Begin at the root node (“a” in FIG.
2
). Compare the first letter of the word to the letter in the root node. If there is a match, proceed to step
3
.
2. If there isn't a match, follow the node's right link and match the letter against that node's letter. If there isn't a match, repeat this step. Proceed until either a match is found or the current node has no right link. If there is no right link, the algorithm terminates and the word isn't in the dictionary. Otherwise, continue to step
3
.
3. If the matching node's letter is the “not in the alphabet” token, the algorithm terminates and the word is present in the dictionary.
4. Otherwise, follow the node's left link and advance to the next letter in the word. (If there are no more letters in the word, use the “not in the alphabet” token as the next letter.) Compare this new letter to the current node's letter and go back to step
2
.
Another type of data structure that may be used to store strings of characters, such as the words in a dictionary, is an array or matrix. In fact, the tree data structures discussed above may be converted to an array data structure.
FIG. 3
illustrates an array in which the phrase “Now is the time for all good men to come to the aid of their country” is stored. Each row is a node in the trie (the root node is row
0
). Each column contains the link fields of all nodes corresponding to the letter at its head (e.g., the “o” column contains all of the link fields corresponding to the letter “o”). A period represents an empty link (e.g., the period in the “b” column of row
0
means that there are no words in the dictionary starting with “b”, and the period in the “d” column of row
15
means there are no words in the dictionary beginning with “ad”). In this way, the periods represent the absence of branches in FIG.
1
. Internally, the periods are represented by the value
0
, since nothing can loop back to the root node. The # sign at the top of the last column is the column for the “not in the alphabet” character. A negative one (−1) in this column is a link to the “end of word” node. In this way, nodes and characters of strings may be ascertained from the array shown in FIG.
3
.
Arrays provide faster searching for strings than trees. With trees, the nodes of the trees must be traversed to find a node matching the first character of the subject string. However, with arrays, the first character is instantly located in row zero in the column corresponding to the first character. Subsequent characters or dependent nodes can readily be determined from the array without having to traverse, and access, non-matching nodes. Although arrays are faster to search, they are not as efficient at storing data as trees are, as there may be numerous empty cells in the array. As can be seen, the tries in
FIGS. 1 and 2
have no non-empty nodes or cells whereas the array in
FIG. 3
has more empty cells than non-empty. In fact, the array in
FIG. 3
is referred to as a sparse array because many array cells are empty.
Bo

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method, system, program, and data structure for a dense... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method, system, program, and data structure for a dense..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method, system, program, and data structure for a dense... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2974683

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.