System and method of finding near neighbors in large metric...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000, C382S106000, C382S159000, C382S209000, C382S225000

Reexamination Certificate

active

06446068

ABSTRACT:

BACKGROUND
The problem of information retrieval from databases is becoming more and more important as our world moves further into the information age. Not only are databases becoming larger and more ubiquitous, they are being used to store more complex data types than in the past. This might include such data types as textual documents, images, audio clips, and multimedia files. Often such items can be characterized by vectors of relatively large dimensionality. For example, a set of 200 by 200 pixel images could be viewed as vectors in a 40,000-dimensional space. In other situations, complex data types may not be easily mapped into a vector space, yet can still be characterized as points in a “large metric space”, in which distances between pairs of data items can nevertheless be computed. In either case, it is desirable to have methods that can quickly search for stored data points which in some sense are good matches for a search query. For example, the query might be an image, and the goal of search might then be to find similar images in the database. Such a search is often called a “near neighbor” search, in that we are seeking stored data items which are nearby the query point, in terms of some distance metric.
There are well-known methods for performing near neighbor searches efficiently when the data dimensionality is small, such as 3-dimensional for example. Typically such methods store the data in some type of tree structure, such that a search can proceed by following a path from the tree's root to a “leaf node” which (hopefully) best matches the query.
FIG. 1
illustrates such a structure with a two-dimensional set of data points, lettered a to r. In the figure, the data has been hierarchically clustered into three levels, as indicated by the thickness of the circle representing each point. At the highest level are the data points g, k, and m, which we will suppose were randomly chosen out of the entire data set. These three points are used to partition the rest of the data into three regions, such that any given point is associated with the high-level point it is closest to. The thick lines indicate this separation.
At the next level of the hierarchy, two points have been randomly chosen within each high-level region, and are indicated by medium-thickness circles. For example, e and c were chosen in the “g region”. The rest of the data points are then again split up, this time according to which second-level point they are closest to. The thin lines indicate these sub-regions. Similarly, we could form more levels, as would be done in a typical application, which would have many more data points.
FIG. 2
shows how the data set of
FIG. 1
can be viewed as a tree, containing a set of “nodes” connected by directed “links”. In this case the tree's root node does not correspond to any actual data item, but will be used conceptually as the starting point for any search. More generally, prior art methods sometimes use trees in which no nodes except the leaf nodes correspond to actual retrievable data points. Such differences are not relevant for present purposes, however. Also, some of the tree's nodes contain more than two child nodes; this is done here in order to limit the number of tree levels for explanatory purposes. Prior art methods may in general have any number of children for a given node, but commonly have two or less.
FIGS. 2 and 3
illustrate with dotted lines how a typical search might be conducted within the data set and search tree of
FIGS. 1 and 2
. In
FIG. 3
, an additional point, labeled Q is shown, to indicate a hypothetical query point. The goal of search is then to find the data point which is the nearest neighbor of Q. More generally, we might be interested in multiple nearest neighbors, or simply “near” neighbors, that is, approximately nearest neighbors. Starting at the root node, the search first compares Q to each of the first-level nodes. Out of these three, node m is the closest to Q, so it is chosen as the current search node. The child nodes of m are then checked, of which node p is closest to Q, so node p becomes the current node. The children of p are then checked, but neither of them is closer to Q than is p, so the search terminates, with node p as the result.
Note that the result of this particular search, node p, is not actually the nearest neighbor to the query Q. Such a possibility must be considered with virtually any non-exhaustive search method, and can be a problem whenever the query is not identical to any data point. If only an approximately nearest neighbor is required, this may not be a problem. Otherwise, the search must be modified to check additional paths. In the example of
FIGS. 1 through 3
, it is apparent that a search beginning at node k is necessary in order to find the true nearest neighbor of Q, which is node o. Prior art methods which are tree-based typically use some form of “backtracking” in order to deal with this problem. In particular, they typically use distance information at each node to decide which branches need to be searched, and which don't, depending on how likely each branch is to contain the true nearest neighbor(s). This often takes the form of backtracking, because a reasonable approach is to first do a fast search for an exact match, which only needs to follow one root-leaf path, and then subsequently back up and search other branches if an exact match is not found.
When data dimensionality is large, for example tens of dimensions or more, prior art search methods become less efficient. The same is true for non-vector data items having comparable levels of complexity. The most common approach by far in such situations is still a tree-structured database. A problem with this approach is that the number of tree nodes which must be searched increases severely, until in the limit of very large dimensionality, the search is no faster than simple exhaustive search by direct comparison of the query to every database item. There appear to be at least two reasons for this. The first reason is that for any reasonable data set, the search space becomes mostly empty as the dimensionality grows very large. Put differently, the average distance between a data item and its nearest neighbor grows large and, importantly, becomes nearly the same as the average distance between two randomly chosen points. Another way of saying this is that the distribution of interpoint distances becomes very narrow, with increasing dimensionality. Because of this, measuring the distance between a given point and the query provides very little information as to which other points will be closest to the query. This is a well-known problem, and is often called the “dimensionality curse”.
However, while the dimensionality curse seems to be an insurmountable problem, there is an additional problem with tree structures, which has not been addressed by the prior art. In particular, conventional tree structures are too rigid to allow a fast search, even when the dimensionality curse does not apply. A tree only divides up the search space in one way, and any given data item is associated with only one branch of the tree. This rigid division implies that there will be fixed boundaries between sub-regions (as shown in FIG.
1
). Moreover, whenever a query point falls near such a boundary, the search procedure will typically need to check the sub-regions on both sides of the boundary. This, of course, makes the search less efficient for such query points. Furthermore, as the effective dimensionality of the space increases, the probability that a query will be near a boundary—and indeed, near many boundaries—increases dramatically. Because of this, a typical tree search in a large-dimensional space is required to check many branches. This problem is a direct result of the use of a strict tree structure, and occurs even when the space is not “mostly empty” as described above. I will call this the “rigid hierarchy problem”, to distinguish it from the more general dimensionality curse problem which results from empty

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method of finding near neighbors in large metric... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method of finding near neighbors in large metric..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method of finding near neighbors in large metric... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2914069

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.