System and method for similarity searching in...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details System and method for similarity searching in... System and method for similarity searching in...

: 1998-10-07
: 2001-09-11
: Ho, Ruay Lian (Department: 2172)
: Data processing: database and file management or data structures
: Database design
: Data structure types

: C707S793000, C707S793000
: Reexamination Certificate
: active
: 06289354
: ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to searching high-dimensional databases. In particular, related information in a database is identified by performing a similarity search. Furthermore, the present invention relates to performing a nearest-neighbor similarity search to find instances in the database that are similar to a target instance.
BACKGROUND OF THE INVENTION
It is often desirable to detect related instances in a database that correspond to related information. Instances of information are represented and stored in a database in the form of a set of data values (feature vectors) in a multidimensional space. A dimension in the multidimensional space is a feature that characterizes objects represented by the data values. For example, consider the database of a credit card company containing customer information. Each customer is an object corresponding to an instance in the database that is a customer profile, or data value, in the multidimensional feature space of the database. Each data value is an n-tuple corresponding to an instance features, such as, age, sex, and salary. The dimensions of the multidimensional feature space are the features that characterize the customer namely the age, sex, and salary, as well as other information.
The nearest-neighbor problem is the problem of performing a similarity search to find related (“similar”) instances of information in the database. Data values in the database are deemed related if they lie a short distance from each other in the multidimensional feature space of the database. Specifically, the nearest-neighbor problem is that of finding the data value or set of k data values in the database that is closest (“similar”), in the sense of a distance metric, to a given target value.
The nearest-neighbor problem is an important problem with applications in data mining, collaborative filtering, and multimedia image database searching. See, for example, Roussopoulos N., Kelley S., and Vincent F., “Nearest neighbor Queries”, Proceedings of the ACM-SIGMOD International Conference on Management of Data, pp. 71-79, 1995. Data mining involves extracting structured information from large amounts of raw data. For example, a database belonging to a credit card company may contain demographic information including age, sex, and salary. By finding entries in the database with similar demographics it may be possible to deduce the risk associated with a particular debt. Collaborative filtering involves deducing the preferences of a given user by examining information about the preferences of other similar users. For example, suppose that customers of a music Compact Disc (CD) distributor asks customers to rank the CDs they like. If the distributor sells many different CDs, then it is likely that customers may not spend the time to rank all of the CDs available. For the purposes of marketing and sales promotions it may be desirable to try to predict the rank a particular customer may give to a particular CD, when the particular CD had not been previously ranked by that customer. Such a prediction may be produced, for example, by the average ranking of that particular CD, by other customers with similar characteristics to the particular customer.
Another example of an application of the nearest-neighbor problem is a text similarity search. For a given target web page, it may be desirable to find all other pages which are similar to it. Suppose that web pages are described in terms of keywords. In this case, a feature corresponds to a keyword. Finding the nearest-neighbor can be achieved by identifying keywords in the target and then finding neighbor WebPages based on the identified keywords.
Similarity searching may also be an important tool for image database applications. It may be useful to allow a user to search an image database for similar images. Features of an image may represent information related to color histograms, texture, luminescence, chrominescence or other characteristics of an image. Images may be deemed similar based on the likeness of image features.
Various methods for finding nearest-neighbors have been proposed, see for example, Roussopoulos N., Kelley S., and Vincent F., “Nearest neighbor Queries”, Proceedings of the ACM-SIGMOD International Conference on Management of Data, pp. 71-79, 1995; Berchtold S., Keim D., and Kriegel H. P., “The X-Tree: An Index Structure for High Dimensional Data”, Proceedings of the 22
nd
International Conference in Very Large Databases, pp. 28-39, 1996; and Berchtold S., Ertl B., Keim D. A., Kriegel H. P., and Seidel T., “Fast Nearest Neighbor Search in High-dimensional space”, Proceedings of the International Conference on Data Engineering, pp. 209-218, February, 1998. These previously proposed methods, however, do not allow the flexibility of performing similarity searches based on a subset, specified by a user, of the set of attributes characterizing objects represented by the data values in the database. Further, these previously proposed methods do not allow for the flexibility of identifying an entry in a database that is most similar to a multiplicity of target entries. Moreover, some of the previously proposed methods do not allow a user to search for more than one entry in the database that is similar to a given target entry.
SUMMARY OF THE INVENTION
Information is analyzed in the form of a plurality of data values that represent a plurality of objects. A set of features that characterize each object of the plurality of objects is identified. The plurality of data values are stored in a database. Each data value corresponds to at least one of the plurality of objects based on the set of features. Ones of the plurality of data values stored in the database are partitioned into a plurality of clusters. Each cluster of the plurality of clusters is assigned to one respective node of a plurality of nodes arranged in a tree hierarchy. Ones of the plurality of nodes of the tree hierarchy are traversed.

REFERENCES:
patent: 5577241 (1996-11-01), Spencer
patent: 5977890 (1999-11-01), Rigoutsos et al.
patent: 6092065 (2000-07-01), Floratos et al.
patent: 6108666 (2000-08-01), Floratos et al.
R. Sibson, “SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method”,The Computer Journal, The British Computer Society, vol. 16, No. 1, Feb. 1973.
Tian Zhang et al., “BIRCH: An Efficient Data Clustering Method for Very Large Databases”, Proceedings of the ACM SIGMOND International Conference on Management of Data, Montreal, Canada, Jun. 1996, pp. 103-114.
Raymond, T. Ng et al., “Efficient and Effective Clustering Methods for Spatial Data Mining”, Proceedings of the 20thInternational Conference on Very Large Data Bases, Santiago, Chile, 1994, pp. 144-155.
Nick Roussopoulos et al., “Nearest Neighbor Queries”, Proceedings of the ACM-SIGMOD International Conference on Management of Data, 1995, pp. 71-79.
Stefan Berchtold et al., “Fast Nearest Neighbor Search in High-dimensional Space”, Proceedings of the International Conference on Data Engineering, Feb. 1998, pp. 209-218.
David A. White et al., “Similarity Indexing with the SS-Tree”, Proceedings of the 12thInternational Conference on Data Engineering, New Orleans, U.S.A., Feb. 1996, pp. 516-523.
Stefan Berchtold et al., “The X-Tree: An Index Structure for High-Dimensional Data”, Proceedings of the 22ndInternational Conference in Very Large Databases, 1996, pp. 28-39.
Douglas H. Fisher, “Knowledge Acquisition Via Incremental Conceptual Clustering”, Machine Learning 2(2), 1987, pp. 139-172.
Mohamed Zait et al., “A Comparative Study of Clustering Methods”, FGCS Journal, Special Issue on Data Mining, 1997, pp. 149-159.
Samet H., Design and Analysis of Spatial Datastructures, Addison Wesley, 1993, pp. 135-141.

Affiliated with

Aggarwal Charu Chandra

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Hailpern Brent Tzion

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Wolf Joel Leonard

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Yu Philip Shi-Lung

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

Ho Ruay Lian

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

International Business Machines - Corporation

Corporate Assignee

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for similarity searching in... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for similarity searching in..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for similarity searching in... will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-2498191

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure