Detecting duplicate records in database

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000

Reexamination Certificate

active

06961721

ABSTRACT:
The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.

REFERENCES:
patent: 6542896 (2003-04-01), Gruenwald
Dina Bitton and David DeWitt, Duplicate Record Elimination in Large Data Files, ACM Transactions on Database Systems. (TODS) 8(2), 1983.
Vinayak Borkar, Kaustubh Deshmukh and Sunita Sarawagi, Automatic Segmentation of Text into Structured Records, In Proceedings of ACM Sigmod Conference, Santa Barbara, CA May 2001.
A. Broder, S. Glassman, M. Manasse and G. Sweig, Syntactic Clustering of the Web, In Proc. Sixth Int'l World Wide Web Conference, World Wide Web Consortium, Cambridge, pp. 391-404, 1997.
K. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft, When is “Nearest Neighbor” Meaningful? International Conference on Database Theory, pp. 217-235, Jan. 1999.
Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval. Addison Wesley Longman, 1999.
W. Cohen, Integration of Heterogeneous Databases Without Common Domains Using Queries Based in Textual Similarity. In Proceedings of ACM SIGMOD, pp. 201-212, Seattle, WA Jun. 1998.
Ronald Forino, Data e.quality: A Behind the Scenes Perspective on Data Cleansing. http://www.dmreview.com/, Mar. 2001.
I.P. Felligi and A. B. Sunter. A Theory for Record Linkage. Journal of the American Statistical Society, 64:1183-1210, 1969.
Helena Galhardas. Data Cleaning Commercial Tools. http://caravel.inria.fr/˜cleaning.html.
Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon and Cristian Saita. Declarative Data Cleaning: Language, Model and Algorithms. In Proceedings of the 27thInternational Conference on Very Large Databases, pp. 371-380, Roma, Italy, Sep. 11-14, 2001.
Helena Galhardas, Daniela Florescu, Dennis Shasha and Eric Simon. An Extensible Framework for Data Cleaning. In ACM SIGMOD, May 1999.
L. Gravano, P. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava. Approximate String Joins in a Database (Almost) for Free. In Proceedings of the 27thInternational Conference on Very Large Databases, pp. 491-500, Roma, Italy, Sep. 11-14, 2001.
Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan. Cactus-Clustering Categorical Data Using Summaries. In Proceedings of the ACM SIGKDD Fifth International Conference on Knowledge Discovery in Databases, pp. 73-83, Aug. 15-18, 1999.
David Gibson, Jon Kleinberg and Prabhakar Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems. VLDB 1998, New York City, New York, Aug. 24-27.
Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, Rock: A Robust Clustering Algorithm for Catagorical Attributes. In Proceedings of the IEEE International Conference on Data Engineering, Sydney, Mar. 1999.
H. Hernandez and S. Stolfo. The merge/purge problem for large data databases. In Proceedings of the ACM SIGMOD, pp. 127-138, San Jose, California, May 1995.
J. Kivinen and H. Mannila. Approximate Dependency Inference from Relations. Theoretical Computer Science 149(1):129-149, Sep. 1995.
J. Madhavan, P. Bernstein, E. Rahm. Generic Schema Matching with Cupid. VLDB 2001, pp. 49-58, Roma, Italy.
Alvaro Monge and Charles Elkan. The Field Matching Problem: Algorithms and Applications. In Proceedings of the Second International Conference on Knowledge Discovery and Databases (KDD), 1996.
A. Monge and C. Elkan. An Efficient Domain Independent Algorithm for Detecting Approximately Duplicate Database Records. In Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery Tuscon, Arizona May 1997.
H. Mannila and K.-J. Raiha. Algorithms for Inferring Functional Dependencies. Data and Knowledge Engineering, 12(1):83-99, Feb. 1994.
Felix Naumann and Claudia Rolker. Do Metadata Models Meet IQ Requirements? In Proceedings of the Internaitonal Conference on Data Quality (IQ), MIT, Cambridge, 1999.
MIT Total Data Quality Management Program. Information quality. http://web.mit.edu/tdqm/www/iqc.
Erhard Rahm and H. Hai Do. Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin, 23(4):3-13, Dec. 2000.
Vijayshankar Raman and Joe Hellerstein. Potter's Wheel: An Interactive Data Cleaning System. VLDB 2001, pp. 381-390, Roma, Italy.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Detecting duplicate records in database does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Detecting duplicate records in database, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Detecting duplicate records in database will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3489230

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.