Method and apparatus determining and using hash functions...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06226629

ABSTRACT:

BACKGROUND OF THE INVENTION
The concept of “relational databases” is well-known. Relational databases are described generally in Date,
Database Systems
, Addison-Wesley publishing Co. 1986, which is herein incorporated by reference. Relational databases usually include a plurality of tables that are searched (“queried”) using a well-known query language, such as the SQL or SQL/MP query languages.
Very large databases can include many large tables. Each of these tables includes a plurality of rows and columns. Because it can become quite time-consuming for the system to perform queries on large database having large tables, the process of “query optimization” becomes quite important in processing queries for large databases. Query optimization is usually performed by the software that will execute the query prior to executing the query.
Query optimization in relational databases relies heavily not only on table size, i.e., rowcount, but also column cardinalities, i.e., total number of distinct values in a column. Some conventional systems employ histograms in the optimization process, determining cardinalities for percentile or other partitions. Other conventional systems rely on cost functions derived primarily from such estimates.
Sorting once per column is a rather costly approach, and on the other hand, mathematical statistics has not succeeded in applying sampling theory to this problem with the usual efficacy.
Probabilistic counting methods can achieve arbitrary accuracy for multi-column histogram/UEC estimates in a single parallel pass with minimal memory requirements. A pre-requisite is a hash function producing uniformly distributed output on all input data sets, where uniform is defined as “yielding a uniform statistical distribution.” Conventional systems have not employed a truly uniform hashing function. Thus, probabilistic counting has not yielded sufficiently accurate results based on rigorous tests.
For example, a technique described in by Flajolet involves some non-trivial analytic number theory, is based on the intuition that a uniform hash onto s-bit integers should produce values . . . 10
k
, with probability approximately 2
−k
, for k<s (where 0
k
represents a string of k: consecutive 0's.) The largest such k can be used to estimate log2(n). (See Flajolet et al, “Probabilistic Counting Algorithms for Database Applications,” Journal of Computer and System Sciences, Vol. 31, No. 2, October 1985, pp. 182-209; Flajolet et al., “Probabilistic Counting,” Proc. 24
th
Annual Symposium on the Foundations of Computer Science,” (IEEE) November 1983, pp. 76-82; and Flajolet, “On Adaptive Sampling,” Computing 34, pp. 391-408, each of which is herein incorporated by reference).
Each successive hash value is accordingly examined to determine the rightmost 1-bit, whose position is recorded by OR'ing a bitmap of size m=2
s
. At the conclusion of the run, the bitmap will have the form . . . 01
R
, and the expected value of R is proven to be
E
(
R
)=log2(&phgr;
n
),
where &phgr;=0.77351 is analytically determined, with a standard deviation &sgr;
R
=1.12 (bits).
To reduce the expected error (which would exceed a factor of 2), an appeal is made to the central limit theorem, which, as the basis of sampling theory, asserts that for estimators based on summation sampling variance decreases by a factor of N as sample sizes increase by the same amount, and that sampling distributions tend towards normality. (Without such guarantees, the association of a probability of error with a standard deviation would be practically meaningless).
SUMMARY OF THE INVENTION
The described embodiment of the present invention introduces the creation and use of two nearly uniform independent hash functions. The hash functions are created using only linear arithmetic and 4-byte machine register operations and, thus, can be created very quickly. They can also be created quickly for each of a large group of items. The first uniform hash function h
1
and the second uniform hashing function h
2
are pairwise independent. Such a fast, nearly uniform hashing method is extremely useful in, for example, database optimizers, which need to be able to determine the cardinality of a large group of items, such as items in a database table.
As will also be described, such uniform hash functions can be used for a number of other applications where a uniform hash proves valuable. One such application lies in the field of encryption and data security, where a fast, nearly uniform hash function can be used to generate message digests for data to be transmitted electronically.
In accordance with the purpose of the invention, as embodied and broadly described herein, the invention relates to a method of determining a pair of hash functions, by a data processing system having a memory, comprising the steps of: choosing four 32-bit random values; determining two pairwise independent hash values, in accordance with the four 32-bit random values and a 32-bit value x, using only linear arithmetic and 4-byte machine register operations.
In further accordance with the purpose of the invention, as embodied and broadly described herein, the invention relates to a method of determining a number of unique values in a large number of data items stored in a memory of a data processing system, the method comprising the steps of: providing in the memory a linear bitmap of a first predetermined size and a second plurality of bitmaps of a second predetermined size; creating, using only linear arithmetic and 4-byte machine register operations, a first uniform hashing function h
1
and a second uniform hashing function h
2
, which are pairwise independent; using h
1
to update the first bitmap; using h
2
to update the second plurality of bitmaps; determining a first cardinality from the linear bitmap; if the cardinality is less than or equal to a predetermined number, deeming the first cardinality to be the number of unique items; and if the cardinality is greater than the predetermined number, determining a second cardinality in accordance with the plurality of bitmaps and deeming the second cardinality to be the number of unique items.


REFERENCES:
patent: 5032987 (1991-07-01), Broder et al.
patent: 5204966 (1993-04-01), Wittenberg et al.
patent: 5633858 (1997-05-01), Chang et al.
patent: 5778395 (1998-07-01), Whitting et al.
patent: 5802521 (1998-09-01), Ziauddin et al.
patent: 5949884 (1999-09-01), Adams et al.
patent: 5950185 (1999-09-01), Alon et al.
“Security in Computing,” Pfleeger, Charles Second Edition, Prentice Hall, 1996, Chapter 3, pp. 101-110, Chapter 4, pp. 169-170, p. 552, Apr. 1996.*
Abstract, Wigderson, A., “The Amazing Power of Pairwise”, Institute for Computer Science Hebrew, pp. 1-2, Jul. 1, 1977, Jerusalem, Israel.
Blellock, G. E. et al.; A Comparison of Sorting Algorithms for the Connection Machine CM-2; pp. 1-30, date unknown.
Abstract, Wigderson, A., “The Amazing Power of Pairwise”, Institute for Computer Science Hebrew, pp. 1-2, Jul. 1, 1977, Jerusalem, Israel.
Morton M. Astrahan, Mario Schkolnick and Kyu-Young Whang, “Approximating the Number of Unique Values of an Attribute Without Sorting,” Information Systems, vol. 12, No. 1, 1987, pp. 11-15.
Philippe Flajolet and G. Nigel Martin, “Probabilistic Counting Algorithms for Data Base Applications,” Journal of Computer and System Sciences, vol. 31, 1985, pp. 182-209.
Philippe Flajolet and G. Nigel Martin, “Probabilistic Counting,” 24th Annual Symposium on Foundations of Computer-Science, Nov. 7-9, 1983, pp. 76-82.
Kyu-Young Whang, Brad T. Vander-Zanden and Howard M. Taylor, “A Linear-Time Probabilistic Counting Algorithm for Database Applications,” ACM Transactions on Database Systems, vol. 15, No. 2, Jun. 1990, pp. 208-229.
Samuel S. Wilks, “Mathematical Statistics,” John Wiley & Sons, Inc., Sec. 9.3., pp. 259 and 260. Date Unknown.
S. Seshadri and Jeffrey F. Naughton, “Sampling Issues in Parallel Database Systems,” 3rd International Conference on Extending Database Technology, Springer-Verlag, Mar. 23-27, 1992,

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus determining and using hash functions... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus determining and using hash functions..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus determining and using hash functions... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2525529

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.