Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2007-11-13
2007-11-13
Gaffin, Jeffrey (Department: 2165)
Data processing: database and file management or data structures
Database design
Data structure types
Reexamination Certificate
active
10600083
ABSTRACT:
To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.
REFERENCES:
patent: 5577249 (1996-11-01), Califano
patent: 6556987 (2003-04-01), Brown et al.
patent: 6636850 (2003-10-01), Lepien
patent: 6961721 (2005-11-01), Chaudhuri et al.
patent: 7051277 (2006-05-01), Kephart et al.
patent: 2002/0124015 (2002-09-01), Cardno et al.
patent: 2004/0249789 (2004-12-01), Kapoor et al.
Rahm et al., “Data Cleaning: Problems and Current Approaches”, IEEE Bulletin of the Technical Committee on Data Engineering, vol. 23, No. 4, Dec. 2000.
Hernandez et al., “Real-world Data is Dirty: Data Cleaning and The Merge/Purge Problem”, Journal of Data Mining and Knowledge Discovery 2(1):9-37; 1998.
Ananthakrishna et al., “Eliminating Duplicates in Data Warehouses”, Proceedings of the 28th International Conference on Very Large Databases (VLDB) 2002, Hong Kong.
R. Ananthakrishna, S. Chaudhuri and V. Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. InProceedings of VLDB, Hong Kong, 2002.
A.N: Arslan, O. Egecioglu and P.A. Pevzner. A New Approach to Sequence Comparison: Normalized Local Alignment.Bioinformatics. 17(4):327-337, 2001.
J.A. Aslam, K, Pelekhov and D. Rus. Static and Dynamic Information Organization with Star Clusters.CIKM1998, pp. 208-217.
J.A. Aslam, K, Pelekhov and D. Rus. A Practical Clustering Algorithm for Static and Dynamic Information Organization.ACM-SIAM Symposium on Discrete Algorithms, 1999.
V. Borkar, K Deshmukh and S. Sarawagi. Automatic Segmentation of Text Into Structured Records. InProceedings of ACM SIGMOD Conference, Santa Barbara, CA, May 2001.
A. Broder, S. Glassman, M. Manasse and G. Zweig. Syntatctic Clustering of the Web. InProc. Sixth Int'l. World Wide Web Conference, World Wide Web Consortium, Cambridge, pp. 391-404, 1997.
K. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft. When is “Nearest Neighbor” Meaningful?International Conference on Database Theory, pp. 217-235. Jan. 1999.
S. Chaudhuri, K. Ganjam, V. Ganti and R. Motwani. Robust and Efficient Fuzzy Match for Online Data Cleaning. InProceedings of ACM SIGMOD, San Diego, CA Jun. 2003.
W. Cohen. Integration of Heterogenous Databases Without Common Domains Using Queries Based on Textual Similarity. InProceedings of ACM SIGMOD, pp. 201-212, Seattle, WA Jun. 1998.
R. Forino.Data e.quality: A behind the Scenes Perspective on Data Cleansing. http://www.dmreview.com/, Mar. 2001.
H. Galhardas, D. Florescu, D. Shasha, E. Simon and C. Saita. Declarative Data Cleaning: Language, Model, and Algorithms. InProceedings of the 27thInternational Conference on Very Large Databases, pp. 371-380, Roma, Italy, Sep. 11-14, 2001.
H. Galhardas, D. Florescu, D. Shasha and E. Simon. An Extensible Framework for Data Cleaning. InACM SIGMOD, May 1999.
L. Gravano, P. Ipeirotis, H.V. Jagadish, N. Kousas, S. Muthukrishnan and D. Srivastava. Approximate String Joins in a Database (Almost) For Free. InProceedings of the VLDB2001.
V. Ganti, J. Gehrke and R. Ramakrishnan. Cactus-Clustering Categorical Data Using Summaries. InProceedings of the ACM SIGKDD Fifth International Conference on Knowledge Discovery in Databases, pp. 78-83, Aug. 15-18, 1999.
D. Gibson, J. Kleinberg and P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems,VLDB1998, New York City, New York, Aug. 24-27.
S. Guha, R. Rastogi and K. Shim. Rock: A Robust Clustering Algorithm for Categorical Attributes. InProceedings of the IEEE International Conference on Data Engineering, Sydney, Mar. 1999.
Y. Huhtala, J. Karkkainen, P. Porkka and H. Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. InProceedings of the 14thInternational Conference on Data Engineering(ICDE), pp. 392-401, Orlando, Florida, Feb. 1998.
M. Hernandez and S. Stolfo. The Merge/Purge Problem for Large Databases. InProceedings of the ACM SIGMOD, pp. 127-138, San Jose, CA May 1995.
J. Madhavan, P. Bernstein, E. Rahm. Generic Schema Matching with Cupid.VLDB2001, pp. 49-58, Roma, Italy.
A. Monge and C. Elkan. The Field Matching Problem: Algorithms and Applications. InProceedings of the Second Interntational Conference on Knowledge Discovery and Databass(KDD), 1996.
A. Monge and C. Elkan. An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. InProceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery, Tucson, Arizona, May 1997.
F. Naumann and C. Rolker. Do Metadata Models Meet IQ Requirements? InProceedings of the International Conference on Data Quality(IQ),MIT, Cambridge, 1999.
E. Rahm and H. Hai Do. Data Cleaning: Problems and Current Approaches.IEEE Data Engineering Bulletin, 23(4):3-13, Dec. 2000.
V. Raman and J. Hellerstein. Potter's Wheel: An Interactive Data Cleaning System.VLDB 2001, pp. 381-390, Roma, Italy.
S. Sarawagi and A. Bhamidipaty. Interactive Deduplication Using Active Learning. InProceedings of the Eighth ACM SIGKDD International Conference on Knowledge Disocvery in Databases, Edmonton, Canada, Jul. 23-26, 2002.
Chaudhuri Surajit
Ganjam Kris
Ganti Venkatesh
Motwani Rajeev
Gaffin Jeffrey
Hicks Michael J
Microsoft Corporation
LandOfFree
Efficient fuzzy match for evaluating data records does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Efficient fuzzy match for evaluating data records, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Efficient fuzzy match for evaluating data records will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3840736