Method for finding large numbers of keywords in continuous...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C704S009000, C704S010000

Reexamination Certificate

active

06311183

ABSTRACT:

FIELD OF THE INVENTION
This invention relates to the fields of text processing and selective dissemination of information (SDI), and specifically to the field of rapid and massive string matching, also referred to as pattern matching.
BACKGROUND OF THE INVENTION
The current flood of e-mail, newsgroup postings, and other network broadcasts has renewed interest in SDI (selective dissemination of information) systems, in which text samples are run against relatively static subscriber profiles. A text sample that satisfies a profile is selected and subsequently routed to the appropriate subscriber.
Selection is often done by matching keywords specified in the profiles. In advance of processing any sample text, words of interest are compiled in a dictionary. Each text sample is then processed by a keyword scanner, which consults the dictionary, looking for matches between dictionary words and substrings of the text. More generally, the text sample may be a sample stream consisting of arbitrary bytes, and the dictionary may consist of keywords or patterns of an arbitrary nature. In practice, such a keyword scanner need accommodate a large number of arbitrary keywords (say, ten thousand to one hundred thousand) and rapidly produce a vector of alarms, identifying keywords found and each occurrence location in the sample stream. Other processes may then use the alarm vector for selection and routing purposes.
The prior art contains several methods for searching a sample stream for all occurrences of dictionary keywords. One of the most successful approaches is the Aho-Corasick algorithm (A. V. Aho and M. J. Corasick, ‘Efficient String Matching: An Aid to Bibliographic Search,’ Communications of the ACM, Vol.18, 1975, reprinted in Computer Algorithms: string pattern matching strategies, Jun-ichi Aoe, ed., IEEE Computer Press, Los Alamitos, Calif., pp. 78-85, 1994). The Aho-Corasick algorithm implements a finite state machine, which undergoes a transition with each byte read from the sample stream. When states representing the ends of keywords are reached, the keyword matches are reported. The Aho-Corasick algorithm must read every byte in the sample stream, limiting the speed unacceptably. It also requires a great deal of memory for fast operation.
An alternative method was disclosed by Commentz-Walter (B. Commentz-Walter, ‘A String Matching Algorithm Fast on the Average,’ Proceedings of the International Colloquium on Automata, Languages and Programming, Lecture Notes in Computer Science, Vol. 71, H. A. Mauer, ed., Springer-Verlag, Berlin, 1979) and improved by Baeza-Yates and Régnier (R. Baeza-Yates and M. Régnier, ‘Fast Algorithms for Two Dimensional and Multiple Pattern Matching,’ SWAT 90, 2nd Scandinavian Workshop on Algorithm Theory, Lecture Notes in Computer Science, Vol. 447, J. R. Gilbert and R. Karlsson, eds., Springer-Verlag, Berlin, pp. 332-347, 1990). This method, too, employs a finite state machine, driven by bytes taken backwards from a pivot point in the sample stream. Unlike the Aho-Corasick algorithm, it may “skip” (without examinination) some of the sample stream bytes by advancing the pivot by more than one byte, but it is slowed by the need to revisit bytes already examined. For large dictionaries, revisiting dominates skipping, and it is slower than the algorithm of Aho-Corasick. It has memory requirement commensurate with the Aho-Corasick algorithm.
Other methods require the use of special-purpose hardware to be competitive.
SUMMARY OF THE INVENTION
In consideration of the problems detailed above and the limitations enumerated in the partial solutions thereto, an object of my invention is to provide a method for locating all matches of a keyword dictionary in a sample stream.
Another object of my invention is to accommodate large dictionaries of keywords.
A further object of my invention is to accommodate arbitrary byte strings as keywords.
A still further object of my invention is to locate keywords rapidly by not examining every sample stream byte.
In order to attain the objectives described above, according to an aspect of my invention, there is provided a method for locating all matches of a keyword dictionary in a sample stream using n-grams of length n, the steps of which are summarized below.
My method moves forward through the sample stream, considering sample positions as locations for the termination of one or more keywords. As it examines a sample position, it applies a sequence of tests. The early tests hope to reject possible locations before more computationally expensive tests need be applied; the final tests explicitly attempt to match a few remaining candidate keywords to the sample.
The first test employs an extension to digrams and to dictionaries of the method of Boyer-Moore-Horspool skipping (R. N. Horspool, ‘Practical Fast Searching in Strings,’
Software—Practice and Experience
, Vol. 10, 501-506, 1980). Consider the problem of locating keyword k=k
1
k
2
. . . k
m
in the sample stream s=s
1
s
2
s
3
. . . , Horspool describes a method for locating a single keyword in the sample stream by testing successively advancing alignments in the sample stream. For each test alignment, the sample stream byte aligned with the last byte k
m
in the keyword is examined. Denote this sample stream byte by c. If c is not present in the keyword, then no match can occur at the current alignment and the next feasible alignment is m bytes to the right, since any intermediate alignment would attempt to match c with one of the keyword bytes known not to be c. More generally, if the last occurrence of c in the keyword is k
j
, then the next feasible alignment is m−j bytes to the right. To facilitate scanning for the single keyword using this observation, one can make a skip table T
0
, indexed by byte values, with entries defined by
T
0

[
c
]
=
{
min

{
m
-
j
|
k
j
=
c
}
,
c

k
m
,
c

k
.
All occurrences of the keyword may be located in the following manner: begin alignment at position m. Let c always denote the sample stream byte at the current alignment position. If T
0
[c] is nonzero, move the alignment To[c] bytes to the right and repeat until T
0
[c] is zero. If T
0
[c] is zero, examine the remaining sample stream bytes aligned with the keyword, reporting any matches that have occurred. Advance alignment 1 byte to the right and repeat. This process is rapid, because it permits examination of all feasible alignments, while permitting some bytes in the sample stream to be “skipped” without examination.
This process may be extended to a dictionary of multiple keywords D={k
1
, k
2
, . . . , k
|D|
}, with k
i
=k
1
i
k
2
i
. . . k
m
i
i
, aided by a new skip table
T
1

[
c
]
=
{
min
i
,
0

j

m
i

{
m
i
-
j
|
k
j
i
=
c
}
,
c

D
min
i

{
m
i
}
,
c

D
,
where
min
i

{
m
i
}
is the minimum keyword length. The procedure for finding feasible alignments is similar to the method for a single keyword, the differences being the use of skip table T
1
rather than T
0
, and the fact that more keywords must be tested at each feasible alignment. Again, speed is gained by skipping whenever T
1
[c] is greater than zero.
In practice, this method will not serve well for large dictionaries. First, the entries in T
1
usually get smaller as the dictionary increases, with most becoming zero, making skipping too rare to be useful. Second, one is still faced with testing the many keywords at each feasible alignement. The first limitation may be overcome by extending Boyer-Moore-Horspool skipping to digrams. The second limitation may be addressed by terminal n-gram sorting of keywords. Both of these aspects of my invention will be described below.
Boyer-Moore-Horspool skipping may be extended to digrams as follows: Given the sample position p, one may ask whether the digram d=s
p−1
s
p
terminates a keyword and, if not, how far the sample position may be advanced before a feasible align

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for finding large numbers of keywords in continuous... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for finding large numbers of keywords in continuous..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for finding large numbers of keywords in continuous... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2593648

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.