Method and system for improving data quality in large...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C715S252000, C715S252000

Reexamination Certificate

active

06968331

ABSTRACT:
A computing system and method clean a set of hypertext documents to minimize violations of a Hypertext Information Retrieval (IR) rule set. Then, the system and method performs an information retrieval operation on the resulting cleaned data. The cleaning process includes decomposing each page of the set of hypertext documents into one or more pagelets; identifying possible templates; and eliminating the templates from the data. Traditional IR search and mining algorithms can then be used to search on the remaining pagelets, as opposed to the original pages, to provide cleaner, more precise results.

REFERENCES:
patent: 5909677 (1999-06-01), Broder et al.
patent: 6119124 (2000-09-01), Broder et al.
patent: 6138113 (2000-10-01), Dean et al.
patent: 6230155 (2001-05-01), Broder et al.
patent: 6349296 (2002-02-01), Broder et al.
patent: 6614764 (2003-09-01), Rodeheffer et al.
patent: 6615209 (2003-09-01), Gomes et al.
patent: 6658423 (2003-12-01), Pugh et al.
patent: 6665837 (2003-12-01), Dean et al.
Manber, U. “Finding Similar Files in a Large File System”, Technical Report TR 93-33, University of Arizona, Department of Computer Science, Oct. 1993.
Broder, A.Z. “Some Applications of Rabin's Fingerprinting Method”, in R. Capocelli, A. De Santis, U. Vaccaro (eds), “Sequence II: Methods in Communications, Security and Computer Science”, Springer-Verlag, 1993.
Agrawal, R. and R. Srikant “Fast Algorithms for Mining Association Rules”, Proceedings of the 20thVLDB Conference, pp. 487 499, 1994.
Brin, S., J. Davis and H. Garcia-Molina “Copy Detection Mechanisms for Digital Documents”, Proceedings of the ACM SIGMOD Conference, pp. 398-409, May 1995.
Heintze, N. “Scalable Document Fingerprinting (Extended Abstract)”, Proceedings of the 1996 USENIX Workshop on Electroni Commerce, Nov. 1996.
Broder, A.Z. “On the Resemblance and Containment of Documents”, Proceedings of Compression and Complexity of SEQUENCES, p. 21, Jun. 11-13, 1997.
Broder, A.Z., S.C. Glassman, M.S. Manasse and G. Zweig “Syntactic Clustering of the Web”, Proceedings of the 6thInternational World Wide Web (WWW) Conference (WWW6), pp. 1157-1166, 1997.
Fang, M., N. Shivakumar, H. Garcia-Molina, R. Motwani and J.D. Ullman “Computing Iceberg Queries Effectively”, Proceeding of the 24thVLDB Conference, 1998.
Kumar, R., P. Raghavan, R. Rajagopalan and A. Tomkins “Trawling the Web for Emerging Cyber-Communities”, Proceedings the 8thInternational World Wide Web (WWW) Conference (WWW8), pp. 1481-1493, 1999.
W3C “Document Object Model (DOM) Level 2 Core Specification Version 1.0, W3C Recommendation Nov. 13, 2000”, downloaded from www.w3.org.
Davidson, B.D. “Recognizing Nepotistic Links on the Web”, Proceedings of the AAAI-2000 Workshop on Artificial Intelligence fo Web Search, pp. 23-28, 2000.
Chakrabarti, S., M. Joshi and V. Tawde “Enhanced Topic Distillation Using Text, Markup Tags and Hyperlinks”, Proceedings o the ACM SIGIR Conference, Sep. 9-12, 2001.
Crescenzi, V., G. Mecca and P. Merialdo “RoadRunner: Towards Automatic Data Extraction from Large Web Sites”, Proceedings of the 27thVLDB Conference, 2001.
Bar-Yossef, Z. and S. Rajagopalan “Template Detection via Data Mining and its Applications”, Proceedings of the WWW2002 Conference, pp. 580-591, May 7-11, 2002.
Haveliwala, T.H., A. Gionis, D. Klein and P. Indyk “Evaluating Strategies for Similarity Search on the Web”, Proceedings of the WWW2002 Conference, May 7-11, 2002.
Crescenzi, V., G. Mecca and P. Merialdo “RoadRunner: Automatic Data Extraction from Data-Intensive Web Sites”, Proceedin of the ACM SIGMOD Conference, p. 624, Jun. 4-6, 2002.
Laender, A.H.F., B.A. Ribeiro-Neto, A.S. da Silva and J.S. Teixeira “A Brief Survey of Web Data Extraction Tools”, SIGMOD Record, vol. 31, No. 2, pp. 84-93, Jun. 2002.
Arasu, A. and H. Garcia-Molina “Extracting Structured Data from Web Pages”, Proceedings of the ACM SIGMOD Conference, Jun. 9-12, 2003.
Yi, L., B. Liu and X. Li “Eliminating Noisy Information in Web Pages for Data Mining”, Proceedings of the ACM SIGKDD Conference, Aug. 24-27, 2003.
Ma, L., N. Goharian, A. Chowdhury and M. Chung “Extracting Unstructured Data from Template Generated Web Documents”, Proceedings of the 12theInternational Conference on Information and Knowledge Management, pp. 512-515, Nov. 3-8, 2003.
Huang, L. “A Survey on Web Information Retrieval Technologies”, Technical Report TR-120, Experimental Computer Systems Lab (ECSL), Department of Computer Science, SUNY Stony Brook, Feb. 2000.
Bharat, K. and A. Broder “Mirror, Mirror on the Web: A Syudy of Host Pairs with Replicated Content”, Proceedings of the 8thInternational Conference on the World Wide Web (WWW99), May 1999.
Shivakumar, N. and H. Garcia-Molina “SCAM: A Copy Detection Mechanism for Digital Documents”, Proceedings of the 2ndAnnual Conference on Theory abd Practice of Digital Libraries, Jun. 1995.
Broder, A.Z., Glassman, S.C. and Manasse, M.S., “Syntactic Clustering of the Web,” In Proceedings of the 6thInternational World Wide Web Conference (WWW6), pp. 1157-1166, 1997.
Bharat, K. and Henzinger, M.R., “Improved Algorithms for Topic Distillation in a Hyperlinked Environment,” In Proceedings of the 21stAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 104-111, 1998.
Brin, S. and Page, L., “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” In Proceedings of the 7thInternational World Wide Web Conference (WWW7), pp. 107-117, 1998.
Chakrabarti, S., Dom, B.E., Gibson, D., Kleinberg, J.M., Raghavan, P. and Rajagopalan, S., “Automatic Resource List Compilation by Analyzing Hyperlink Structure and Associated Text,” In Proceedings of the 7thInternational World Wide Web Conference (WWW7), pp. 65-74, 1998.
Chakrabarti, S., Dom, B.E., Gibson, D., Kleinberg., J.M., Kumar, S.R., Raghavan, P., Rajagopalan, S. and Tomkins, A., “Hypersearching the Web,” Scientific American, Jun. 1999.
Chakrabarti, S., Dom, B. and Indyk, P., “Enhanced Hypertext Categorization Using Hyperlinks,” In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, pp. 307-318, 1998.
Chakrabarti, S., van den Berg, M. and Dom, B.E., “Distributed Hypertext Resource Discovery through Examples,”, In Proceedings of the 25thInternational Conference on Very Large Databases (VLDB), pp. 375-386, 1999.
Chakrabarti S., van den Berg, M. and Dom, B.E., “Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery,” In Proceedings of the 8thInternational World Wide Web Conference (WWW8), pp. 1623-1640, 1999.
Davison, B.D., “Recognizing Nepoistic Links on the Web,” In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search, pp. 23-28, 2000.
Dean, J. and Henzinger, M.R., “Finding Related Pages in the World Wide Web,” In Proceedings of the 8thInternational World Wide Web Conference (WWW8), pp. 1467-1479, 1999.
Gibson, D. Kleinberg, J.M. and Raghavan, P., “Inferring Web Communities from Link Topology,” In Proceedings of the 9thACM Conference on Hypertext and Hypermedia, pp. 225-234, 1998.
Google. Google. http://www.google.com.
Kleinberg, J.M., “Authoritative Sources in a Hyperlinked Environment,” Journal of the ACM, pp. 604-632, 1999.
Kumar, R., Raghavan, P., Rajagopalan, S. and Tomkins, A., “Trawling the Web for Emerging Cyber-Communities,” In Proceedings of the 8thInternational World Wide Web Conference (WWW8), pp. 1481-1493, 1999.
Lempel, R. and Moran, S., “The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect,” In Proceedings of the 9thInternational World Wide Web Conference (WWW9), pp. 38

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and system for improving data quality in large... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and system for improving data quality in large..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for improving data quality in large... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3509111

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.