Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2007-10-23
2007-10-23
Pham, Hung Q (Department: 2168)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
10453992
ABSTRACT:
A process for finding a similar data records from a set of data records. A database table or tables provide a number of data records from which one or more canonical data records are identified. Tokens are identified within the data records and classified according to attribute field. A similarity score is assigned to data records in relation to other data records based on a similarity between tokens of the data records. Data records whose similarity score with respect to each other is greater than a threshold form one or more groups of data records. The records or tuples form nodes of a graph wherein edges between nodes represent a similarity score between records of a group. Within each group a canonical record is identified based on the similarity of data records to each other within the group.
REFERENCES:
patent: 5497486 (1996-03-01), Stolfo et al.
patent: 2004/0158562 (2004-08-01), Caulfield et al.
J.A. Aslam, K. Pelekov and D. Rus. Static and Dynamic Information Organization with Star Clusters. CIKM 1998, pp. 208-217.
R. Ananthakrishna, S. Chaudhuri and V. Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. InProceedings of VLDB, Hong Kong, 2002.
A.N. Arslan, O. Egecioglu and P.A. Pevzner. A New Approach to Sequence Comparison: Normalized Local Alignment.Bioinformatics.17(4):327-337, 2001.
J.A. Aslam, K. Pelekhov and D. Rus. Static and Dynamic Information Organization with Star Clusters.CIKM1998, pp. 208-217.
J.A. Aslam, K, Pelekhov and D. Rus. A Practical Clustering Algorithm for Static and Dynamic Information Organization.ACM-SIAM Symposium on Discrete Algorithms, 1999.
V. Borkar, K Deshmukh and S. Sarawagi. Automatic Segmentation of Text Into Structured Records. InProceedings of ACM SIGMOD Conference, Santa Barbara, CA, May. 2001.
A. Broder, S. Glassman, M. Manasse and G. Zweig. Syntatctic Clustering of the Web. InProc. Sixth Int'l. World Wide Web Conference, World Wide Web Consortium, Cambridge, pp. 391-404, 1997.
K. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft. When is “Nearest Neighbor” Meaningful?International Conference on Database Theory, pp. 217-235. Jan. 1999.
S. Chaudhuri, K. Ganjam, V. Ganti and R. Motwani, Robust and Efficient Fuzzy Match for Online Data Cleaning. InProceedings of ACM SIGMOD, San Diego, CA Jun. 2003.
W. Cohen. Integration of Heterogenous Databases Without Common Domains Using Queries Based on Textual Similarity. InProceedings of ACM SIGMOD, pp. 201-212, Seattle, WA Jun. 1998.
R. Forino.Data e.quality: A behind the Scenes Perspective on Data Cleansing.http:www.dmreview.com/, Mar. 2001.
H. Galhardas, D. Florescu, D. Shasha, E. Simon and C. Saita. Declarative Data Cleaning: Language, Model, and Algorithms. InProceedings of the 27thInternational Conference on Very Large Databases, pp. 371-380, Roma, Italy, Sep. 11-14, 2001.
H. Galhardas, D. Florescu, D. Shasha and E. Simon. An Extensible Framework for Data Cleaning. InACM SIGMOD, May 1999.
L. Gravano, P. Ipeirotis, H.V. Jagadish, N. Kousas, S. Muthukrishnan and D. Srivastava. Approximate String Joins in a Database (Almost) For Free. InProceedings of the VLDB2001.
V. Ganti, J. Gehrke and R. Ramakrishnan. Cactus-Clustering Categorical Data Using Summaries. InProceedings of the ACM SIGKDD Fifth International Conference on Knowledge Discovery in Databases, pp. 73-83, Aug. 15-18, 1999.
D. Gibson, J. Kleinberg and P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems,VLDB1998, New York City, New York, Aug. 24-27.
S. Guha, R. Rastogi and K. Shim. Rock: A Robust Clustering Algorithm for Categorical Attributes. InProceedings of the IEEE International Conference on Data Engineering, Sydney, Mar. 1999.
Y. Huhtala, J. Karkkainen, P. Porkka and H. Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. InProceedings of the 14thInternational Conference on Data Engineering(ICDE), pp. 392-401, Orlando, Florida, Feb. 1998.
M. Hernandez and S. Stolfo. The Merge/Purge Problem for Large Databases. InProceedings of the ACM SIGMOD, pp. 127-138, San Jose, CA May 1995.
J. Madhavan, P. Bernstein, E. Rahm. Genetic Schema Matching with Cupid.VLDB2001, pp. 49-58, Roma, Italy.
A. Monge and C. Elkan. The Field Matching Problem: Algorithms and Applications. InProceedings of the Second International Conference on Knowledge Discovery and Databass(KDD), 1996.
A. Monge and C. Elkan. An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. InProceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery, Tucson, Arizona, May 1997.
F. Naumann and C. Rolker. Do Metadata Models Meet IQ Requirements? InProceedings of the International Conference on Data Quality(IQ),MIT, Cambridge, 1999.
E. Rahm and H. Hai Do. Data Cleaning: Problems and Current Approaches.IEEE Data Engineering Bulletin, 23(4):3-13, Dec. 2000.
V. Raman and J. Hellerstein. Potter's Wheel: An Interactive Data Cleaning System.VLDB 2001, pp. 381-390, Roma, Italy.
S. Sarawagi and A. Bhamidipaty. Interactive Deduplication Using Active Learning. InProceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery in Databases, Edmonton, Canada, Jul. 23-26, 2002.
Chaudhuri Surajit
Ganti Venkatesh
Kapoor Rahul
Microsoft Corporation
Pham Hung Q
LandOfFree
Duplicate data elimination system does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Duplicate data elimination system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Duplicate data elimination system will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3901613