Primitive operator for similarity joins in data cleaning

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000

Reexamination Certificate

active

07406479

ABSTRACT:
A set similarity join system and method are provided. The system can be employed to facilitate data cleaning based on similarities through the identification of “close” tuples (e.g., records and/or rows). “Closeness” can be is evaluated using a similarity function(s) chosen to suit the domain and/or application. Thus, the system facilitates generic domain-independent data cleansing.The system can be employed with a foundational primitive, the set similarity join (SSJoin) operator, which can be used as a building block to implement a broad variety of notions of similarity (e.g., edit similarity, Jaccard similarity, generalized edit similarity, hamming distance, soundex, etc.) as well as similarity based on co-occurrences. The SSJoin operator can exploit the observation that set overlap can be used effectively to support a variety of similarity functions. The SSJoin operator compares values based on “sets” associated with (or explicitly constructed for) each one of them.

REFERENCES:
patent: 2004/0260694 (2004-12-01), Chaudhuri et al.
patent: 2005/0055321 (2005-03-01), Fratkina et al.
patent: 2005/0262044 (2005-11-01), Chaudhuri et al.
patent: 2006/0179052 (2006-08-01), Pauws et al.
Ananthakrishna, et al. “Eliminating Fuzzy Duplicates in Data Warehouses” Proceedings if the 28th VLDB Conference, Hong Kong, China (2002) 12 pages.
Chatziantoniou, et al. “Querying Multiple Features of Groups in Relational Databases” Proceedings of the 22nd VLDB Conference Mumbai(Bombay), India (1996) pp. 295-306.
Chatziantoniou, et al. “Groupwise Processing of Relational Queries” Proceedings of the 23rd VLDB Conference Athens, Greece (1997) pp. 476-485.
Chaudhuri, et al. “Robust and Efficient Fuzzy Match for Online Data Cleaning” SIGMOD San Diego, California (Jun. 9-12, 2003) 12 pages.
Cohen, William W. “Data Integration Using Similarity Joins and a Word-Based Information Representation Language” ACM Transactions of Information Systems, vol. 18 No. 3 (Jul. 2000) 34 pages.
Gravano, et al. “Text Joins in an RDBMS for Web Data Integration” WWW2003 Budapest, Hungary (May 20-24, 2003) 12 pages.
Gravano, et al. Approximate String Joins in a Database (Almost) for Free) Proceedings of the 27th VLDB Conference, Rome, Italy (2001) 10 pages.
Guha, et al. “Merging the Results of Approximate Match Operations” Proceedings of the 30th VLDB Conference, Toronto, Canada (2004)pp. 636-647.
Hernandez, et al. “The Merge/Purge Problem for Large Databases” SIGMOD San Jose, California (1995) pp. 127-138.
Ramasamy, et al. “Set Containment Joins: The Good, The Bad and The Ugly” Proceedings of the 26th VLDB Conference, Cario, Egypt (2000) pp. 351-362.
Sarawagi, et al. “Efficient Set of Joins on Similarity Predicates” SIGMOD Paris, France (Jun. 13-18, 2004) 12 pages.
Chaudhuri, et al. “Robust Identification of Fuzzy Duplicates” (2004) Proceedings of the 1st ACM Workshop on Hardcopy Document Proceedings12 pages.
Felligi, et al. “A Theory for Record Linkage” (1969) American Statistical Association vol. 64, 29 pages.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Primitive operator for similarity joins in data cleaning does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Primitive operator for similarity joins in data cleaning, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Primitive operator for similarity joins in data cleaning will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2766289

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.