Method for determining approximate hamming distance and...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06226640

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates generally to information retrieval from electronic storage devices, and more particularly, to a method and system for determining approximate hamming distance of two strings and approximate nearest neighbors of a query.
Comparing files or documents that reside remotely in different inquiring processors in a network is a task, which requires significant communication between the inquiring processors. For example, when a first inquiring processor wishes to compare a first file that resides in the first inquiring processor with a second file that resides in a remote second inquiring processor, the first and second inquiring processors must communicate the files or information about the files over the network.
The least sophisticated method for determining whether the two files match each other is to transmit one of the files over the network and to compare the files at one of the inquiring processors. Communicating an entire file, of course, is not efficient since the size of the file may be large.
A more efficient method for comparing the two files is to communicate, for example, the hash value of one of the files over the network and to compare the respective hash values of the files at one of the inquiring processors. This method, however, only checks for an exact match between the two files.
Hence, it is desirable to estimate at an inquiring processor how closely two files match each other. A hamming distance is one measure of how closely two files or strings match each other. For example, given two strings that are of equal length and include a sequence of bits, the hamming distance of the two strings represents the number of non-matching bits in the two strings.
Similarly, in electronic storage applications, an entry in an electronic storage device is a nearest neighbor of a query when the content of that entry is the closest match from among other entries in the storage device. For example, if the query and the entries in the storage device each include a sequence of d bits, a nearest neighbor entry in the storage device is an entry that has the least number of non-matching bits when compared with the query.
Searching for entries that are the nearest neighbors of a query is a task, which is performed in a variety of applications, including information retrieval, data mining, web search engines and other web related applications, pattern recognition, machine learning, computer vision, data compression, and statistical analysis. Many of these applications represent the entries in an electronic storage device as vectors in a high dimensional space. For example, one known method for textual information retrieval uses a latent semantic indexing, where the semantic contents of the entries and queries are represented as vectors in a high dimensional space.
The least sophisticated method for searching an electronic storage device for the nearest neighbors of a query is to compare, on-line or off-line, each entry in the storage device with the query. Comparing each and every entry with the query, of course, is not practical since the size of an average electronic device is large and continues to increase with the advancements in information storage technology.
Other known methods attempt to reduce the high dimensional representation of entries in electronic storage devices. For example, J. Kleinberg, “Two Algorithms For Nearest-Neighbor Search In High Dimensions,” in the proceedings of 29
th
Symposium Of Theory Of Computing, pp. 599-608 (1997), discloses two algorithms for reducing the search space when determining the nearest neighbors in an electronic storage device. The Kleinberg algorithms search for the nearest neighbors by drawing random projections from vectors, which represent the entries in the storage device, to a set of random lines in Euclidean space.
P. Indyk and R. Motwani, “Approximate Nearest Neighbors: Towards Removing The Curse Of Dimensionality,” in the proceedings of 30
th
Symposium Of Theory Of Computing (1998), discloses another algorithm for reducing the search space. The Indyk and Motwani algorithm searches for the nearest neighbors in an electronic storage device by partitioning the search space into spheres and by categorizing the entries in the storage device into buckets.
The above methods, however, requirement significant processing and storage resources. Therefore, it is desirable to have a method and system for overcoming the above and other disadvantages of the prior art.
DESCRIPTION OF THE INVENTION
Methods and systems consistent with the present invention determine whether a first string in an electronic storage device resides within a first hamming distance of a second string in the storage device. As used herein, “electronic storage device” refers to any processing system that stores information that a user at an inquiring processor may wish to retrieve. Moreover, the terms “electronic storage device” and “database” will be used interchangeably and should be understood in their broadest sense.
In one embodiment, a database determines one or more nearest neighbors of a query that are within a first hamming distance of the query. The database prebuilds a first set of strings by probabilistically selecting values of respective bits in each of the first set of strings based on a probability that depends on the first hamming distance. Based on the first set of strings, the database predetermines the trace values of data entries in the database, respectively, and stores the predetermined trace values as entries in a trace table.
For each trace value entry, the database identifies the data entries whose trace values are within a second hamming distance of the trace value entry, and stores the addresses of the identified data entries in the trace value entry. When the database receives a query, by identifying the trace value entry in the trace table that matches the trace value of the query, the database determines whether any data entries are within the first hamming distance of the query.
In another embodiment, a first processor communicates with a second processor to determine whether a first string that resides in the first processor is within a first hamming distance of a second string that resides in the second processor. The first processor and the second processor each have access to a shared third string that includes a plurality of bits, where the value of each bit is probabilistically pre-selected based on a probability that depends on a first hamming distance. The first processor computes a first inner product of the first string with the third string, and sends the first inner product to the second processor. When the second processor receives the first inner product, the second processor computes a second inner product of the second string with the third string.
The second processor compares the first inner product with the second inner product to determine whether the first string is within the first hamming distance of the second string as follows: The second processor determines that the distance between the first string and the second string is less than the first hamming distance when the first inner product equals the second inner product. The second processor determines that the distance between the first string and the second string is greater than the first hamming distance when the first inner product is different from the second inner product.
In yet another embodiment, the first processor and the second processor each have access to a shared set of strings that include a plurality of bits, where the value of each bit is probabilistically pre-selected based on a probability that depends on a first hamming distance. The first processor computes a first set of inner products of the first string with each of the set of strings, and sends the first set of inner products to the second processor. When the second processor receives the first set of inner products, the second processor computes a second set of inner products of the second string with the each of the set of stri

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for determining approximate hamming distance and... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for determining approximate hamming distance and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for determining approximate hamming distance and... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2506804

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.