Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2001-11-26
2004-05-18
Rones, Charles (Department: 2175)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000
Reexamination Certificate
active
06738762
ABSTRACT:
TECHNICAL FIELD
The present invention relates to query processing of textual data stored in miultidimensional data sets and, more particularly, to a method of estimating co-occurrence of query substrings across the dimensions of the data set.
BACKGROUND OF THE INVENTION
In recent years, a new suite of services (such as LDAP directory services), standards (such as XML), and applications (such as E-commerce) have emerged, due at least in part to the proliferation of the Internet. Handling large amounts of text (as opposed to numeric data) is central to such Internet-related technology. Thus, there has been a resurgence of interest in the storage, management and query processing of textual data.
In many applications involving databases that process textual data, users pose (sub)string queries, which may either search for exact matches, or contain wildcards. For example, in an E-commerce catalog search, a user might inquire about all teal colored polo shirts that were sold during the month of June, 2001—this would be an example of an exact match on two separate attributes, namely, color and sale status (i.e., a two-dimensional query). In an LDAP directory search, a user might inquire about all people whose first name begins with the letters “Jo” (i.e., a “prefix” match), and whose phone number contains the sequence of digits “360” (a “substring” match). These examples can be expressed in structured query language (SQL) using the LIKE clause: WHERE color LIKE ‘teal’ AND sales LIKE ‘062001’ and, respectively, WHERE name LIKE ‘Jo%’ AND phone LIKE ‘%360%’. It is to be noted that the queries can specify any combination of exact, prefix, suffix, or proper substring matches.
In order to optimize such queries, particularly multidimensional queries, it is often useful to obtain fast and accurate estimates for their result sizes. One problem in the field of multidimensional substring selectivity estimation is related to the estimation of the fraction of tuples in the database that contain the query string as a substring in each dimension (also referred to as “attribute”), where prefix and suffix constraints can easily be reduced to substring constraints. Such an estimate may also suffice as an approximate answer to a COUNT query that returns the number of tuples in the database which contain the query as a substring in each attribute. Further, fast and accurate estimates for multidimensional string selectivity estimation may also help in refining queries in an online data analysis environment. As in any selectivity estimation, multidimensional string estimation methods must provide acceptable accuracy (for example, no more than 10-20% error) using a data structure that is allowed only a limited amount of space (i.e., small enough to reside in a main memory, but typically, significantly smaller). In addition, the time needed to build the data structure should not be prohibitive so that periodic rebuilding is feasible. Finally, online estimation must be fast (on the order of a fraction of a second), even when working with a massive data structure.
Multidimensional selectivity estimation has been an active area of research for many years. Most work in this area, however, has focused on numerical attributes and has assumed the existence of a mapping from multidimensional categorical data to fully ordered domains. In the case of multidimensional string queries, however, such mapping is of no use. For example, if the strings are sorted lexicographically, the substrings are not necessarily ordered. In the extreme of presuming that all substrings are explicitly represented and the frequency of individual points is approximated in the multidimensional space using standard techniques, the domain would be so large and the frequencies so small as to render the techniques impractical. An end-biased histogram approach may be used as an alternative, retaining only the substrings with the highest counts, subject to constraints on space, and approximating the other substrings assuming uniformity. Since the total number of substrings is very large, this approach would be very close in accuracy to one that makes the uniformity assumption, which is known to be highly inaccurate. Moreover, the need to retain all substrings with high counts becomes aggravated as the dimensionality increases.
New approaches for selectivity estimation tailored to the string domain have been developed in recent years. These approaches share a common framework of first performing a precomputation to store the number of tuples that contain the most frequently co-occurring substrings, defined as the “cross-counts”. Online estimation then involves parsing the query into subqueries such that the cross-count for each subquery is available from the precomputation process. The effectiveness of any particular approach within this framework relies on the prudent utilization of cross-counts.
One exemplary prior art technique for estimating cross-counts of string data is based on a variant of suffix trees and is referred to as the “count-suffix trees” method, where each node in the tree is augmented with the count of occurrences of its associated substring. The method maintains k suffix trees (one for each dimension), and a matrix for storing the cross-counts, that is, the number of occurrences of all substring combinations from each dimension. In order to limit the space used, a “pruning threshold” may be defined, and each suffix tree pruned to form a Pruned Suffix Tree (PST), each PST having at most m nodes (where, for simplicity, the value of m is uniform across the dimensions).
FIG. 1
illustrates the pair of PSTs for the data set: (ab,
12
), (abc,
123
), (bc,
123
), (ab,
23
). FIG.
1
(
a
) includes the PST for the alpha data dimension, and FIG.
1
(
b
) includes the PST for the numeric data dimension. Each node is defined by its substring, and illustrated within each node is the number of occurrences of that substring in the data set. In this case, the alpha data dimension is parsed into ab, b, and c, and the numeric data dimension is parsed into
12
,
2
and
3
. This parsing is a form of “greedy parsing” (i.e., includes overlap), and is illustrated diagrammatically in FIG.
2
(
a
).
FIG. 3
contains a matrix of each of the cross-counts between the parsed substrings. For example, the cross-count between ab and
2
is “
3
”, meaning that there are three different substrings that include the combination of ab with
2
. As with the trees illustrated in
FIG. 1
, the matrix of
FIG. 3
is also “pruned”, as shown by the line through the row of cross-counts associated with the substring
123
and the column of cross-counts associated with the substring abc.
Using the above example of a multidimensional (2-dimensional) query q—(abc,
123
), abc is parsed into pieces ab and c;
123
is parsed into pieces 12 and 3, where the query is used to determine the “cross-count” between abc and
123
. The subqueries resulting from the greedy parsing come from the cross-product of the pieces: (ab,
12
), (c,
12
), (ab,
3
) and (c,
3
). The associated subquery selectivities are then multiplied based on the independence assumption as follows:
Pr
⁢
{
(
abc
,
123
)
}
=
⁢
Pr
⁢
{
(
ab
,
12
)
}
×
Pr
⁢
{
(
c
,
12
)
}
×
Pr
⁢
{
(
ab
,
3
)
}
×
Pr
⁢
{
(
c
,
3
)
}
=
⁢
2
/
4
×
2
/
4
×
2
/
4
×
2
/
4
=
1
/
16.
Of course, this solution of “{fraction (1/16)}” is an exact solution, since this data set contains only four elements and it is straightforward to calculate each quantity.
An alternative method known in the prior art is denoted “k-d count-suffix trees” (or “KD”), in which each node corresponds to substring combinations from each dimension, and each node is augmented with the substring counts.
FIG. 4
illustrates an exemplary pruned k-d count-suffix tree for the data set example defined above (that is, for the four element data set (ab,
12
), (abc,
123
), (bc,
123
), (ab,
23
)). Since the capture of all combinations would require space exponential in k, a pruned data structure is g
Chen Zhiyuan
Korn Philip Russell
Koudas Nikolaos
Muthukrishnan Shanmugavelayutham
AT&T Corp.
Rones Charles
LandOfFree
Multidimensional substring selectivity estimation using set... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Multidimensional substring selectivity estimation using set..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Multidimensional substring selectivity estimation using set... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3186701