Data processing: speech signal processing – linguistics – language – Linguistics – Dictionary building – modification – or prioritization
Utility Patent
1999-02-10
2001-01-02
Isen, Forester W. (Department: 2747)
Data processing: speech signal processing, linguistics, language
Linguistics
Dictionary building, modification, or prioritization
C704S001000, C707S793000
Utility Patent
active
06169969
ABSTRACT:
FIELD OF THE INVENTION
This invention relates to the fields of text processing and selective dissemination of information (SDI), and specifically to the field of rapid and massive string matching.
BACKGROUND OF THE INVENTION
The flood of newsgroup postings, broadcast e-mail, and other electronic documents has necessitated a revived interest in SDI (selective dissemination of information) systems, in which constantly-changing document collections are run against a relatively static body of queries, with the purpose of selecting documents of interest and routing them to subscribers. This body of queries, representing subscriber “profiles” of interest, specifies combinations of patterns or “keywords” that must be found in the documents for selection to take place.
The work of selecting and routing documents naturally separates into two processes: keyword scanning and query resolution. The latter process takes the keyword scanning output and looks for satisfaction of subscriber profiles, labeling successful documents with the appropriate information for subsequent routing. The former concerns itself with locating all keyword occurrences in the documents, and is the primary object of my invention.
In advance of processing any text, words of interest are compiled in a “dictionary.” Each text sample is then processed by consulting the dictionary, looking for matches between dictionary words and substrings of the text. More generally, one can process arbitrary byte streams not confined to text, and the dictionary may contain arbitrary entries as well. For simplicity, these arbitrary dictionary entries are termed “keywords,” though they may not be words in the usual sense. For practical SDI systems, dictionaries need to accommodate tens of thousands to hundreds of thousands of keywords. The scanning is to be performed at current digital data rates—on the order of tens of megabytes per second or more.
One may consider two common problems: the delimited match problem and the undelimited match problem. In the delimited match case, one insists that a match is only recognized in the sample when the keyword is flanked by delimiters. This is the situation, for example, when only whole words in English are recognized, taking spaces and punctuation as delimiters. A more realistic and difficult problem is the undelimited case, when such cues are not assumed, so that matches can be found anywhere. Undelimited searching permits detection of substrings, operation in languages that lack word delimiters (such as Japanese and Chinese), and general processing on byte streams that are not strictly text. Undelimited matching, required of my invention, is a generalization of delimited matching, since one can certainly test for delimiters after a putative match is found.
Early applicable work was done by Bloom: “Space/Time Trade-offs in Hash Coding with Allowable Errors,”
Communications of the ACM,
Vol. 13, No. 7, 1970, pp. 422-426. Given a dictionary of keywords {K
i
},i=1, d and a set of discrete samples {S
i
},i=1, N, Bloom disclosed a software method of determining whether a given sample was present in the dictionary. Bloom suggested a two-test approach in which the first test screens out most of the candidate matches, leaving a second test to remove the false hits. In one of his proposals, a hash table contains only binary entries, with an entry holding a one if any keyword hashes to that location, zero otherwise. Sample S
j
is tested indirectly through a hash function by calculating H(S
j
); if the table bin at this address holds a one, the sample passes the first test and goes on to the second test, in which a full comparison is made. Given a sample not in the dictionary, the probability of a false match in the first test is roughly the fraction of the table occupied by ones. Bloom recognized that the false alarm rate could be improved, up to a point, by applying multiple hash functions to the keywords and samples, addressing into the same binary table: In this scheme, m hash functions H
1
, H
2
, . . . , H
m
are identified. For each keyword K
i
, the table locations at positions H
1
(K
i
), H
2
(K
i
), . . . , H
m
(K
i
) are set. To test a sample S
j
, the bits in table locations H
1
(S
j
), H
2
(S
j
), . . . , H
m
(S
j
) are tested. If all of them are set, the sample passes the first test. Bloom does not address the processing of continuous byte streams.
Pryor et al., in “Text Searching On Splash 2,”
Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines,
1993, pp. 172-177, disclosed an apparatus that embodied and extended the teachings of Bloom, capable of locating delimited strings in continuous bytes streams. In their apparatus, the sample byte stream to be processed feeds parallel hashers, each of which generates addresses for an associated memory. These memories serve as hash tables, and contain entries of one bit each. The outputs from the memories feed a single AND gate, whose output serves as the alarm for a primary test. In advance of operation, the jth memory has been loaded with ones at the locations {H
j
(K
i
) |i=1, . . . , d }; the remaining entries are zero. In operation, when a delimiter is encountered in the stream, the hashers are reset. The bytes after the delimiter enter the hashers, which accumulate the hash function calculations. When the terminating delimiter is located, the hasher outputs are complete and the memory outputs are enabled, so that the memories are polled at the locations specified by the accumulated hash values. If the delimited word just read is a keyword, each of the hash tables will have a one in the specified location, and the AND gate will register an alarm. If the sample word is not a keyword, it is unlikely that all m of the specified memory locations will be set, in which case an alarm is averted. Once an alarm is registered, a secondary test verifies the match. The entire primary test apparatus is implemented in a single board employing field-programmable gate arrays to implement the hashers, paired with memories to act as the hash tables. A hosting computer performs secondary testing. The apparatus of Pryor et al. is not capable of detecting undelimited matches.
The software method of Karp and Rabin, “Efficient Randomized Pattern-Matching Algorithms,”
IBM Journal of Research and Development,
Vol. 31, No. 2, 1987, pp. 249-260, teaches that single keywords may be located in continuous byte streams without the use of delimiters using recursive n-gram hashing. An n-gram is an n-long sequence of consecutive bytes. Given the n-long keyword K, their method compared K to each n-gram S
j
of the sample string through their hashed values: H
1
,(K), H
2
(K), . . . , H
m
(K) were tested against H
1
(S
j
), H
2
(S
j
), . . . , H
m
(S
j
), respectively. Agreement caused a secondary test to verify the match. An extension of Karp-Rabin for whole dictionaries of keywords was proposed by Amir, et al, “Efficient Randomized Dictionary Matching Algorithms,”
Lecture Notes in Computer Science,
Vol. 644, pp. 262-275. Like Bloom, they proposed using a hash table to store the dictionary, though only one hash function was applied. Their method deals with only one length at a time; dictionaries composed of keywords with multiple lengths required multiple passes—one pass for each length. Their method could be implemented in hardware, but requires significant duplication of hardware for each keyword length.
Others have proposed methods and apparatus for matching multiple keywords in continuous byte streams based on finite state automata and on cellular automata. While these approaches do not require delimiters for detection, they present scaling challenges that make them impractical for continuous processing with large dictionaries.
SUMMARY OF THE INVENTION
In consideration of the problems detailed above and the limitations enumerated in the partial solutions thereto, an object of my invention is to provide dictionary string matching that overcomes limitations of the prior art by locating keyw
Bloor Stephen M.
Edouard Patrick N.
Isen Forester W.
The United States of America as represented by the Director of t
LandOfFree
Device and method for full-text large-dictionary string... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Device and method for full-text large-dictionary string..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Device and method for full-text large-dictionary string... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2535199