Data processing apparatus and method

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06334129

ABSTRACT:

BACKGROUND OF THE INVENTION
This invention relates to a data processing apparatus and method for calculating the degree of similarity between vectors and, more particularly, to a data processing apparatus and method for outputting a prescribed number of items of data based upon degree of similarity with a query vector.
In general, distance is used as a means of calculating the degree of similarity between vectors. When a certain key vector has been given, a problem often dealt with is the extraction of L-number of vectors close to the key vector from a set of predetermined vectors. If a comparison operation with respect to all vectors in the set is performed, the amount of computation is on the order of O(MN), where M represents the number of vector dimensions and N the number of elements in the set of vectors.
Accordingly, several high-speed algorithms for distance computation have been proposed. The strategy common to these high-speed algorithms is to structuralize the data in advance and reduce the complexity of computation when distance is calculated.
For example, a method of creating sorting lists based upon the values of components along the respective axes of vectors involves sorting axes by an appropriate priority when a key vector has been given and specifying where the component values of the key vector are located in the sorting list of the axis situated at the highest ranking position. Then, on the basis of the vector IDs, the distances between the vectors and the key vector are calculated in order starting from a vector that has been stored at a nearby position. In order to obtain accurate results, the distance between all vectors and the key vector must be calculated. However, if the order relating to the component values of the selected axes approximately reflects the actual distances between the vectors, then satisfactory results can be attained with a small amount of calculation.
According to this method, an amount of calculation on the order of O(Nlog
2
N) suffices for preprocessing (structuralization of the data) and on the order of O(Llog
2
N) for the comparison with L-number of vectors. Though there are a K-d tree and its derivative versions in addition to this method, which is referred to as a projection method, the orders of the amounts of calculation for the preprocessing and the comparison operation are the same.
The prior art described above concerns the calculation of distances between vectors. However, there are cases where a norm is effective as a gauge for expressing the degree of similarity between vectors. For example, in “A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning”, by Y. Kiyoki, T. Kitagawa and T. Hayama, SIGMOD RECORD, Vol. 23, No. 4 (December 1994) (referred to as “Reference 1” below), the degree of similarity with a context vector is calculated as follows: A projection operator with respect to representation space is generated based upon a context vector, and the norm of a vector in subspace, which has been extracted by this projection operator, is calculated, thereby defining degree of similarity with the context vector.
Further, in “A High-speed Algorithm for a Semantic Image Search by a Mathematical Model of Meaning”, by Miyahara, Shimizu and Kitagawa, Research Report of the Information Processing Society of Japan, Database System 113-41 (Jul. 15, 1997) (referred to as “Reference 2” below), a high-speed calculation algorithm for such calculation of degree of similarity is proposed. This is a direct application of the projection method in the calculation of distance. More specifically, sorting lists concerning each of the axes are created in advance. When a context vector is given, the priorities of the axes are decided based upon the component values of the context vector. The degree of similarity with the context vector is decided based upon the priorities of the sorting lists and the ranking in each list. With this method, the amount of calculation in preprocessing is on the order of O(Nlog
2
N), and the comparison calculation is on the order of the number L of items of data output.
The conventional method of Reference 2 involves certain problems. This will be described in conjunction with
FIG. 2
, which is a diagram illustrating an example of a sorting list created according to the conventional method of Reference 2.
In
FIG. 2
, each member represents vector ID. Each row forms a sorting list and rows are sorted from the top to the bottom in accordance with priority of axes. Also, in each sorting list, vector IDs are listed from the left side to the right side in accordance with its rank.
According to the method of Reference 2, first it is judged that a vector situated at the top of a sorting list of an axis for which the priority is highest has the highest degree of similarity. In
FIG. 2
, this vector has the ID number 10. Next, the vector situated at the second rank of the same axis, namely the vector whose ID number is 6 in
FIG. 2
, is judged to be the vector having the second highest degree of similarity. The vector having the third highest degree of similarity is decided upon as being the vector situated at the top of the sorting list of the axis having the second highest priority. This vector has the ID number 3 in FIG.
2
. Thus, with the method of Reference 2, degrees of similarity are decided in the order of position in the lists. As a result, there is a possibility that the order of ID number 6 and ID number 3 will be the reverse of that of the actual degrees of similarity.
SUMMARY OF THE INVENTION
Accordingly, an object of the present invention is to make it possible to select optimum data from data candidates acquired from a plurality of sorting lists that have been obtained, thereby improving the consistency between the precedence of data candidates and actual degree of similarity.
Another object of the present invention is to make it possible to generate some sorting lists at high speed by selecting sorting lists based upon the strengths of components of a query vector.
A further object of the present invention is to make it possible to select an appropriate sorting list by performing the sorting list selection in accordance with the priority of output data.
In accordance with an embodiment of the present invention for attaining the foregoing objects, there is provided a data processing apparatus for extracting, from a set of data having a vector format, a first prescribed number of items of data having a high degree of similarity with a query vector, comprising: a database storing a set of data having a vector format; list creation means for creating lists of data in each of which data of the database is sorted in order of decreasing strength of respective one component of a vector; list-priority decision means for deciding a priority of each list; input means for inputting the query vector; selection means for successively selecting, from the lists, a second prescribed number of items of data not yet output based upon the list priority and ranking in each list; and output means for outputting one item of data, from the second prescribed number of items of data successively selected by the selection means, based upon degree of similarity between each of the second prescribed number of items of data and the query data.
Other features and advantages of the present invention will be apparent from the following description taken in conjunction with the accompanying drawings, in which like reference characters designate the same or similar parts throughout the figures thereof.


REFERENCES:
patent: 5864846 (1999-01-01), Voorhees et al.
patent: 5920859 (1999-07-01), Li
patent: 5943669 (1999-08-01), Numata
patent: 6131082 (2000-10-01), Hargrave, III et al.
“A Metadabase System for Semantic Image Search By a Mathematical Model of Meaning”, Yasushi Kiyoki, et al., SIGMOD Record, vol, 23, No. 4, Dec. 1994, pp. 34-40.
“A Fast Semantic Image Search Algorithm by a Mathematical Model of Meaning”, T. Miyahara, et al., Research Report of the Information Processing Society of Japan, Database S

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Data processing apparatus and method does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Data processing apparatus and method, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Data processing apparatus and method will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2577160

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.