System and method for regular expression matching using index

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000

Reexamination Certificate

active

06754650

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to matching regular expressions (regex) in data repositories such as the World Wide Web.
2. Description of the Related Art
Enormous amounts of information are available via computer networks, notably the World Wide Web. The information explosion has resulted in over a billion Web pages being available, with the number growing every day.
Perhaps the biggest challenge posed by this information deluge is how to sift through the mountain of data to find specific information. Not surprisingly, much effort has been devoted to efficiently finding sought-after information on the Web. Typically, such efforts focus on key word searching, i.e., enabling a user to type in one or more key words and subsequently returning Web pages that contain some or all of the words, in some ordered fashion.
As recognized herein, however, key word searching often requires a considerable amount of query result analysis by the user. That is, it remains the responsibility of the user to sift through the pages and extract the information being sought. To illustrate, suppose a user wants to find the middle name of Thomas Edison. Simply inputting the query “Thomas Edison” to a key word search engine will result in many hundreds or perhaps thousands of pages being returned, with the user being forced to wade through the result set until he or she happens upon the middle name.
In addition, key word search cannot match patterns which occur within words in text. Consider, for instance, a user who wants to find all references to .mp3 files stored on Stanford University computer systems. Searches of this kind are not possible using key word search methods.
As further recognized herein, examples such as the two postulated above can be more efficiently addressed using what is known in the art as “regular expressions”, or “regexes”. A regex is distinct from a key word search. Essentially, a regex is a string which defines a set of strings satisfying a pattern. A regex can be specified using a number of syntactic methods, including, e.g., that used by the tools flex, awk, and grep, or as set forth in the textbook
Automata Theory, Languages, and Computation.
In the context of the above example, a user could use the following regex to efficiently find the middle name of Thomas Edison: “Thomas \a+ Edison”, with “\a” meaning “any letter in the alphabet” and “+” meaning “one or more repetitions of the previous character”. Thus, the expression “\a\a+” means “find strings matching the pattern Thomas [at least two letters of the alphabet] Edison”, and would return only strings that have at least two letters between the words “Thomas” and “Edison”. In this way, strings that have only the middle initial of Edison or no middle name at all are not returned. Only strings having at least two letters (likely to be the name “Alva”) between “Thomas” and “Edison” are returned.
The present invention still further recognizes, however, that although regex matching can be used in the context of Web searching, for very large repositories like the Web, matching a regular expression using conventional regex principles can take days. Having recognized the above-noted considerations and problems, the present invention provides solutions to one or more of them as disclosed below.
SUMMARY OF THE INVENTION
To address the issues noted above, a regex query system is disclosed in which regex-based indices are first built using multigrams. When a regex query is received, the indices are used to return a set of potentially matching pages, which are then quickly and efficiently searched for matches to the regex query.
A general purpose computer is programmed according to the inventive steps herein. The invention can also be embodied as an article of manufacture—a machine component—that is used by a digital processing apparatus and which tangibly embodies a program of instructions that are executable by the digital processing apparatus to execute the present logic. This invention is realized in a critical machine component that causes a digital processing apparatus to perform the inventive method steps herein.
Accordingly, a general purpose computer includes a regular expression (regex) index generator that generates at least one multigram index, and a run time engine that receives a regex query and that accesses the index to respond to the query. The run time engine returns data from a repository of, e.g., Web pages, in response to the query. In a preferred non-limiting embodiment, a Web crawler is associated with the data repository, and the data repository includes Web pages.
As set forth in detail below, the run time engine parses the regex query and generates an execution plan based on the parsed query. Using the execution plan, potentially matching documents in the repository are found using the index. Then, the query is executed against the matching pages.
Preferably, the index includes plural keys, with each key having an associated posting list representing documents in the repository containing the key. In a preferred implementation, the keys are multigrams. The index preferably contains only minimal useful multigrams relative to a user-defined usefulness factor “c”. Further, the index can be further limited to include only suffix-free minimal useful multigrams. If desired, the index can be a predicate index including multigrams representing binary numbers.
In the presently preferred embodiment the run time engine generates an execution plan based on a parsed query by rewriting a query to have only “OR” or “STAR” logical connectives. The run time engine constructs a parse tree and replaces STAR nodes in the tree with NULL, and then removes NULL nodes. In one optimization, the run time engine executes the query by performing a fall regex match only in a predetermined vicinity of an anchor in a document.
In another aspect, a computer program device includes a computer program storage device that can be read by a digital processing apparatus. A program is on the program storage device. The program includes instructions that are executable by the digital processing apparatus for executing a regex query The program can include computer readable code means for generating at least one index, and computer readable code means for accessing the index to execute a regex query.
In still another aspect, a computer-implemented method for executing a regular expression (regex) query against a data repository includes generating at least one multigram index, and receiving a regex query. The method includes accessing the index based on the query. A set of Web pages is returned based on the accessing act. The Web pages are then searched for matches to the regex query.
The details of the present invention, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which:


REFERENCES:
patent: 5649186 (1997-07-01), Ferguson
patent: 6355423 (2002-03-01), Rothberg et al.
patent: 6493713 (2002-12-01), Kanno
patent: 886226 (1998-12-01), None
patent: 2002157252 (2002-05-01), None
Gilbert, E et al. “Multigram Codes”, IEEE Transactions on Information Theory, Mar. 1982, vol. 28, No. 2, pp 346-348.*
Baeza-Yates R. A. et al. Fast Text Searching for Regular Expressions or Automation Searching on Tries, Journal of the ACM, Nov. 1996, vol. 43, No. 6, pp. 915-936.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for regular expression matching using index does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for regular expression matching using index, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for regular expression matching using index will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3338224

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.