Searching for sequences of character data

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000

Reexamination Certificate

active

06768991

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to the field of data processing systems. More particularly, this invention relates to searching within a sequence of character data for occurrence of one or more words each formed of a predetermined sequence of characters.
2. Description of the Prior Art
It is known to provide content scanning systems that search within files of character data to identify predetermined strings. A typical such system would be an E-mail filtering system in which the strings being search for were members of a predefined collection of banned words, such as swear words.
A problem with such systems is that the volumes of data needing to be checked, such as the volume of E-mail traffic, is tending to increase at the same time as the number of strings that need to be checked for is also increasing. These factors combine to disadvantageously increase the processing load associated with content scanning of this type. Measures that can address this problem are strongly advantageous.
SUMMARY OF THE INVENTION
Viewed from one aspect the present invention provides a computer program product comprising a computer program operable to control a computer to search within a sequence of character data for occurrence of any of a plurality strings each formed of a predetermined sequence of characters, said computer program comprising:
(i) character identifying logic operable to identify a target character at a target search position within said sequence of character data;
(ii) string identifying logic operable to compare character data preceding said target character with data indicative of any of said plurality of strings that terminate with said target character to identify an occurrence of any of said plurality of strings that terminate with said target character; and
(iii) target search position advancing logic operable to advance said target search position within said sequence of character data by a predetermined number of character positions equal to a smallest number of character positions separating said target character from a last character within any of said plurality of strings.
The invention provides a modified form of the Boyer-Moore Fast String Searching Algorithm as is described in the publication “A Fast String Searching Algorithm”, R. S. Boyer with J. S. Moore, Communications of the Association for Computing Machinery, Volume 20, No. 10, page 762-772, 1977. In particular, the invention recognises that this approach may be adapted to search for a plurality of strings each formed of a predetermined sequence of characters in a single pass through the document in a manner that increases the overall efficiency of the search and allows it to cope with increasing data volumes and increasing numbers of strings for which a search is to be made.
The advancing of the search position is preferably controlled by a single jump table established for all of the plurality of strings for which a search is being made. The jump table holds for each character the closest proximity of that character to the end of any of the plurality of strings for which a search is being made. If a character does not appear in any of the strings for which a search is being made, then a jump equal to the shortest of the strings for which a search is being made.
A tree of search portions is kept. There are two distinct portion types in the structure.
An implementor may wish to describe explicitly within the portion its type.
1. A leaf portion. This is represented by:
The maximum length of all strings it contains.
The Depth of this portion in the tree.
A table of the strings matched by this portion.
(The type of the portion).
2. A Branch portion. This is represented by:
The depth of this portion.
The strings which are impicitly matched by this portion.
A table indexed by character describing which portion continues the search.
The portion it describes could be a leaf or branch portion.
(The type of the portion).
The preferred implementation has a different structure for the first portion in the tree. This is because it has no implicit matches (0 length strings) and an opportunity to optimize the first character→portion lookup.
Other aspects of the invention also provide a method for searching within a sequence of character data for predetermined strings and an apparatus for searching within a sequence of character data for predetermined strings.
The above, and other objects, features and advantages of this invention will be apparent from the following detailed description of illustrative embodiments which is to be read in connection with the accompanying drawings.


REFERENCES:
patent: 5226148 (1993-07-01), Littlewood
patent: 5418951 (1995-05-01), Damashek
patent: 5640551 (1997-06-01), Chu et al.
patent: 5708804 (1998-01-01), Goodwin et al.
patent: 5737732 (1998-04-01), Gibson et al.
patent: 6012057 (2000-01-01), Mayer et al.
patent: 6041323 (2000-03-01), Kubota
patent: 2003/0009447 (2003-01-01), Maurray et al.
Boyer et. al., A Fast String Searching Algorithm, Communication of the ACM, Oct. 1997, vol. 20, pp. 762-772.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Searching for sequences of character data does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Searching for sequences of character data, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Searching for sequences of character data will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3255418

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.