Apparatus and method for retrieving vector format data from...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06578031

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to processing vector format data, and more particularly to a data processing apparatus and method for retrieving a predetermined number of data pieces from a database in accordance with a similarity with input vector.
2. Related Background Art
A distance is widely used as a similarity between data expressed by vector. For example, in a character recognition system and a speech recognition system, sampled data is mapped in a feature quantity space spanned by a proper base to store vector-expressed data as a prototype. A distance between prototypes and newly input data is calculated to identify the input data as belonging to a class corresponding to the nearest prototype.
A calculation method having a worst efficiency is an exhaustive search. A calculation amount by this method is in the order of a product of the vector dimension and the number of prototypes.
The calculation amount of a distance or an inner product is recognized as a critical obstacle against database search. Because of recent rapid progress of a computer processing ability, a database can store not only text data but also non-text data such as images and sounds. In order to search such non-text data by using a keyword as in a conventional method, the non-text data is required to be added with a keyword in advance. If it is desired to avoid a work of adding a keyword, it is necessary to perform a similarity search by using feature quantity vector.
Even in searching text data, a similarity search algorithm is used which searches text data by using vector in order to realize flexible search. In this case, the calculation amount becomes a substantial issue in realizing a search system. The number of data pieces stored in a general database is over several hundred thousands. Therefore, as the order of a vector dimension is raised by one, the calculation amount increases desperately by several hundred thousands times.
In order to avoid such a case, it is essential to either lower the order of a vector dimension or reduce the number of data pieces to be calculated. The former corresponds to lowering the order of a vector dimension of a space which expresses data. Therefore, there is a possibility that information necessary for data search is not sufficiently expressed in vector components. The latter becomes meaningful methodology when the number of data pieces requested as search results is sufficiently small as compared with the total number of data pieces. Those cases to be processed by k-NN search belong to this category, and several effective methods have been proposed.
According to the k-NN search, k prototypes nearest to a test vector are searched from a set of prototypes stored in a system, and in accordance with classes of the searched prototypes, the class of the test vector is identified. In this case, one of important issues is how k prototypes nearest to the text vector are searched at high speed. This requirement is also applied to database search.
A search user desires only data pieces nearest to the search key designated by the user, among a large amount of data stored in a database, and does not desire other data pieces at all, much less values of distances and inner products. Techniques satisfying such requirements of a search user are coincident with objectives of a high speed algorithm of k-NN search.
In order to reduce the calculation amount required for searching k prototypes nearest to a test vector from a set of prototypes, each prototype is generally structurized in advance. The more the quality of data is reflected upon when structurization is performed, the more the search calculation amount is expected to be reduced.
For example, if a prototype is structurized hierarchically, an operation of dividing an N-dimensional space expressing a prototype is recursively repeated. A method of dividing the space by using a boundary which is a hyperplane is called a K-D-B Tree [Document 1], a method of dividing the space by a rectangular plane is called an R-Tree [Document 2], a method of dividing the space by a hyper-sphere is called an SS-Tree [Document 3], and a method of dividing the space by a combination of a rectangular plane and a hyper-sphere is called an SR-Tree [Document 4]. If an N-dimensional vector space is mapped to a space spanned by an eigenvector of a covariance matrix representing a prototype distribution, a structurization more effective for reducing a search calculation amount can be expected [Documents 5, 6].
With these methods, however, the calculation amount and storage capacity necessary for data structurization exponentially increases as the order of a vector dimension is raised. Therefore, application to those data expressed by high-dimensional vector may be practically restricted.
[Document 1] J T. Robinson: “The K-D-B Tree: A search Structure for Large Multidimensional Dynamic Indexes”, Proc. on ACM SIGMOD, pp. 10-18, 1981.
[Document 2] A. Guttman: “R-trees: A dynamic index structure for spatial searching”, Proc. ACM SIGMOD, Boston, USA, pp. 47-57, June 1984.
[Document 3] D A. White and R. Jain: “Similarity indexing with the SS-tree”, Proc. of the 12th Int. Conf. on Data Engineering, New Orleans, USA, pp. 323—331, February 1996.
[Document 4] Katayama and Satoh: “SR-Tree: A proposal of index structure for nearest neighbor searching of high dimensional point data”, IEICE Papers (D-I), vol. 18-D-I, no. 8, pp. 703-717, August 1997.
[Document 5] R F. Sproull: “Refinements to Nearest Neighbor Searching in K-dimensional Trees”, Algorithmica, 6, pp. 579-589, 1991.
[Document 6] D A. White and R. Jain: “Similarity Indexing: Algorithms and Performance”, Proc. on SPIE, pp. 62-73, 1996).
There are algorithms which use “gentle” structurization not incorporating statistical means and a little “smart” search algorithm, in order to reduce the calculation amount. Of these, one of the most fundamental algorithms is an algorithm by Friedman et al., called a mapping algorithm [Document 7].
[Document 7] J H. Friedman, F. Baskett, and L J. Shustek: “An Algorithm for Finding Nearest Neighbors”, IEEE Trans. on Computers, pp. 1000-1006, October 1975.
A data structurization requested as a pre-process of the mapping algorithm is a sorting process of sorting vector at each component, which process corresponds to structurization based upon a phase. Namely, if a prototype is d-dimensional vector, d sorting lists are generated.
With this process, two lists including a list V
j
storing j-component values arranged in the ascending order and a list I
j
storing corresponding prototype ID numbers, are formed as many as the order of a vector dimension. Namely, the value V
j
(n+1) at the (n+1)-th component value from the start of V
j
is equal to or larger than V
j
(n) at the n-th component value. The j component value Y
Ij(n)
(j) of the prototype Y
Ij(n)
having the ID number of I
j
(n) is coincident with V
j
(n).
A principle of the mapping algorithm for selecting a pair of prototypes nearest to a test prototype from a prototype set will be described with reference to
FIG. 10. A
search is performed by using a pair of sorting lists V
m
and I
m
selected by a proper criterion. In the example shown in
FIG. 10
, an m-axis is selected. Im stores the ID number of data sorted based upon the component values, so that the order on the list correctly reflects the phase along the m-axis. First, a value nearest to the m component X(m) of a test vector X is searched from V
m
. This value is assumed to be V
m
(j). The prototype corresponding to V
m
(j) is Y
Im(j)
. In the example shown in
FIG. 10
, Y
Im(j)
corresponds to Y
1
. Although Y
1
is nearest to X with respect to the m component, it is not necessarily nearest to X in the whole space.
Next, a distance &rgr;(X, Y
1
) between X and Y
1
is calculated. It can be understood that there is a possibility that only a prototype having the m compone

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Apparatus and method for retrieving vector format data from... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Apparatus and method for retrieving vector format data from..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus and method for retrieving vector format data from... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3124587

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.