Method and apparatus for substring selectivity estimation

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06401088

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to methods of querying a database. More particularly, the present invention relates to systems and methods for estimating substring selectivity in a database.
BACKGROUND OF THE INVENTION
With the growing importance of the Web, LDAP directory servers, and other text-based information stores, there is an ever greater need to evaluate queries involving string matching. One often wishes to obtain a quick estimate of the number of times a particular substring occurs in a database. A traditional application is for optimizing structured query language (SQL) queries with the “like” predicate (e.g. search for all names like “jones”); such predicates are pervasive in data warehouse queries, because of the presence of “unclean” data. Another example of a substring query is a wildcard query, which allows searches for partial matches on strings submitted as a query.
A commonly used data structure for indexing substrings in a database is the suffix tree, which is a trie that satisfies the following property: whenever a string is stored in the trie, then all suffixes of the string are stored in the trie as well. Given a substring query, one can locate all the desired matches using the suffix tree. A count-suffix tree is a variant of the suffix tree, which does not store pointers to occurrences of the substrings, but just keeps a count at the node corresponding to the substring in the tree. The storage requirements, nevertheless, of a full count-suffix tree can be prohibitive. Thus, methods have been proposed for estimating substring selectivity using another variation of the suffix tree: a pruned count-suffix tree (“PST”) which retains only those substrings, and their counts, for which the count exceeds some prune threshold.
FIG. 1
sets forth an example PST, with a prune threshold of 5. Labels are presented for (among others) substrings related to the database string “jones”, with counts shown in parentheses for some of the nodes in the PST. Variations of basic selectivity estimation algorithms have been proposed that, given a PST and a substring query, estimate the fraction of the count associated with the root of the tree that satisfies the substring query. See Krishnan, et al., “Estimating Alphanumeric Selectivity in the Presence of Wildcards,” Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 282-93, 1996, hereby incorporated by reference for background purposes (the methods collectively referred to herein as the “KVI algorithm”).
Histograms have also long been used for selectivity estimation in databases. They have been designed to work well for numeric attribute value domains, and one can obtain good solutions to the histogram construction problem using known techniques. For string domains, however, a histogram bucket that includes a range of consecutive lexicographic values is not likely to produce a good approximation, since the number of times a string occurs as a substring is not likely to be approximately the same for lexicographically successive substrings. Alternative methods have been proposed creating an end-biased histogram over a frequency sort of the attributes. Although the method has a close parallel to a PST, pruned count-suffix trees do better than such methods by taking domain knowledge into account.
Nevertheless, known estimation methods using PSTs can be inefficient and inaccurate. Thus, a need exists for systems and methods for performing substring selectivity estimation that are both more accurate and more efficient than currently-known systems and methods.
SUMMARY OF THE INVENTION
The present invention discloses systems and methods for estimating how many records in a database match a certain query condition. In particular, the present invention discloses systems and methods for substring selectivity that take advantage of database structure.
In one embodiment of the present invention, a probability of occurrence is received for each of a plurality of maximal substrings where each substring in the plurality of maximal substrings belongs to the string. All of these received probabilities of occurrence are multiplied together to obtain a single number, that number being called the first estimate. Next, a probability of occurrence for the maximal overlap of each string between adjacent strings in the plurality of substrings is received. The probabilities are all multiplied together to obtain a single number called a normalization factor. Finally, the first estimate is divided by the normalization factor to obtain a resulting estimate of substring selectivity.


REFERENCES:
patent: 4833712 (1989-05-01), Bahl et al.
patent: 5150430 (1992-09-01), Chu
patent: 5392363 (1995-02-01), Fujisaki et al.
patent: 5525982 (1996-06-01), Cheng et al.
patent: 5819265 (1998-10-01), Ravin et al.
patent: 5832480 (1998-11-01), Byrd et al.
patent: 6308149 (2001-10-01), Gaussier et al.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for substring selectivity estimation does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for substring selectivity estimation, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for substring selectivity estimation will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2900916

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.