Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2005-07-28
2009-10-20
Vo, Tim T. (Department: 2168)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
07606816
ABSTRACT:
Techniques for identifying discrete records within a multi-record document are provided. According to one technique, a document is encoded based on some combination of visual tag encoding, text category encoding, and text content encoding that produces hash values based on the contents of portions of the document. According to one technique, repeating candidate patterns are identified in a document so encoded. The candidate patterns may be identified in a “fuzzy” manner that allows for some inconsistencies in the individual pattern instances. According to one technique, the identified candidate patterns are validated based on specified factors to determine a “best” pattern. According to one technique, the boundaries of discrete records in a multi-record document are marked based on the portions of the document that correspond to an identified repeating pattern.
REFERENCES:
patent: 5875446 (1999-02-01), Brown et al.
patent: 5977890 (1999-11-01), Rigoutsos et al.
patent: 6657564 (2003-12-01), Malik
patent: 2002/0056041 (2002-05-01), Moskowitz
patent: 2003/0115189 (2003-06-01), Srinivasa et al.
patent: 2005/0097160 (2005-05-01), Stob
patent: 2006/0010109 (2006-01-01), Harrity
patent: 2006/0059173 (2006-03-01), Hirsch et al.
“Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages”, Brigham Young University, Utah, USA, 1999, by Embley et al.
Chang, Ahia-Hui, et al., “IEPAD: Information Extraction Based on Pattern Discovery,” WWW10 '01, May 1-5, 2001, Hong Kong, ACM 1-58113-348-0, pp. 681-688. (Text on enclosed CD-Rom).
Cohen, William W., et al., “A Flexible System for Wrapping Tables and Lists in HTML Documents,” Carnegie-Mellon University Department of Computer Science, Sep. 19, 2003, Retrieved from the internet at <www.cs.cmu.edu/People/wcohen/postscript/ws-chap-2002.pdf>, pp. 1-30. (Text on enclosed CD-Rom).
Doorenbos, Robert B., et al., “A Scalable Comparison-Shopping Agent for the World-Wide Web,” Department of Computer Science and Engineering, University of Washington, Seattle, WA., 10 pages. (Text on enclosed CD-Rom).
Eliassi-Rad, Tina, et al., “Using a Trained Text Classifier to Extract Information,” Computer-Sciences Department, University of Wisconsin, located on the internet at: <http://www.cs.wisc.edu/˜eliassi/tech—report.pdf#search=‘Using%20a%20Trained%20Text%20Classifier%20to%20Extract%20Information’>, pp. 1-4. (Text on enclosed CD-Rom).
Embley, D.W., et al., “Record-Boundary Discovery in Web Documents,” Department of Computer Science, Brigham Young University, Dec. 1998, 12 pages. (Text on enclosed CD-Rom).
Hsu, Chun-Nan, et al., “Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web,” Information Systems, 1998, vol. 23, No. 8, pp. 521-538. (Text on enclosed CD-Rom).
Kushmerick, Nicholas, et al., “Information Extraction by Text Classification,” Smart Media Institute, Computer Science Department, University College Dublin, located on the internet at <http://www.cs.ucd.ie/staff
ick/home/research/download/kushmerick-atem2001.pdf#search=‘Information%20Extraction%20by%20Text%20Classification’>, pp. 1-7. (Text on enclosed CR-Rom).
Kushmerick, Nicholas, et al., “Wrapper induction: Efficiency and expressiveness,” Artificial Intelligence 118 (2000), pp. 15-68. (Text on enclosed CD-Rom).
Lerman, Kristina, et al., “Automatic Data Extraction from Lists and Tables in Web Sources,” Information Science Institute, University of California, located on the internet at: <http://www.isi.edu/˜lerman/papers/lerman-atem2001.pdf#search='Automatic%20Data%20Extraction%20from%20Lists%20and%20Tables%20in%20Web%20Sources>, pp. 1-6. (Text on enclosed CD-Rom).
Muslea, Ion, et al., “A Hierarchical Approach to Wrapper Induction,” University of Southern California, <http://www.ai.sri.com/˜muslea/PS/hwi—aa99.pdf#search=‘A%20Hierarchical%20Approach%20to%20Wrapper%20Induction’>, pp. 1-8. (Text on enclosed CD-Rom).
Nigam, Kamal, et al., “Text Classification from Labeled and Unlabeled Documents Using EM,” Machine Learning, located on the internet at: <http://www.kamalnigam.com/papers/emcat-mlj99.pdf#search=‘Text%20Classification%20from%20Labeled%20and%20Unlabeled%20Documents%20Using%20EM’>, pp. 1-34. (Text on enclosed CD-Rom).
“String Matching Algorithms,” Vilnius University, Department of Computer Science, Located on the internet at <www.mif.vu.lt/cs2/courses/ds99fa6.pdf>, pp. 1-25. (Text on enclosed CD-Rom).
Chang, Ahia-Hui, et al., “IEPAD: Information Extraction Based on Pattern Discovery,” WWW10 '01, May 1-5, 2001, Hong Kong, ACM 1-58113-348-0, pp. 681-688.
Cohen, William W. et al., “A Structured Wrapper Induction System for Extracting Information Semi-Structured Documents,” WhizBang! Labs, 7 pages.
Cohen, William W., et al., “A Flexible System for Wrapping Tables and Lists in HTML Documents,” Carnegie-Mellon University Department of Computer Science, Sep. 19, 2003, Retrieved from the internet at <www.cs.cmu.edu/People/wcohen/postscript/ws-chap-2002.pdf>, pp. 1-30.
Doorenbos, Robert B., et al., “A Scalable Comparison-Shopping Agent for the World-Wide Web,” Department of Computer Science and Engineering, University of Washington, Seattle, WA., 10 pages.
Eliassi-Rad, Tina, et al., “Using a Trained Text Classifier to Extract Information,” Computer-Sciences Department, University of Wisconsin, located on the internet at: <http://www.cs.wisc.edu/˜eliassi/tech—report.pdf#search=‘Using%20a%20Trained%20Text%20Classifier%20to%20Extract%20Information’>, pp. 1-4.
Embley, D.W., et al., “Record-Boundary Discovery in Web Documents,” Department of Computer Science, Brigham Young University, Dec. 1998, 12 pages.
Hsu, Chun-Nan, et al., “Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web,” Information Systems, 1998, vol. 23, No. 8, pp. 521-538.
Kukarni, Parashuram, “REBIEX: Record Boundary Identification and Extraction through Pattern Mining,” Yahoo Research and Development Centre, 15 pages.
Kushmerick, Nicholas, et al.,“Information Extraction by Text Classification,” Smart Media Institute, Computer Science Department, University College Dublin, located on the internet at <http://www.cs.ucd.ie/staff
ick/home/research/download/kushmerick-atem2001.pdf#search=‘Information%20Extraction%20by%20Text%20Classification’>, pp. 1-7.
Kushmerick, Nicholas, et al., “Wrapper induction: Efficiency and expressiveness,” Artificial Intelligence 118 (2000), pp. 15-68.
Lerman, Kristina, et al., “Automatic Data Extraction from Lists and Tables in Web Sources,” Information Science Institute, University of California, located on the internet at: <http://www.isi.edu/˜lerman/papers/lerman-atem2001.pdf#search='Automatic%20Data%20Extraction%20from%20Lists%20and%20Tables%20in%20Web%20Sources>, pp. 1-6.
Muslea, Ion, et al., “A Hierarchical Approach to Wrapper Induction,” University of Southern California, <http://www.ai.sri.com/˜muslea/PS/hwi—aa99.pdf#search=‘A%20Hierarchical%20Approach%20to%20Wrapper%20Induction’>, pp. 1-8.
Nigam, Kamal, et al., “Text Classification from Labeled and Unlabeled Documents Using EM,” Machine Learning, located on the internet at: <http://kamalnigam.com/papers/emcat-mlj99.pdf#search=‘Text%20Classification%20from%20Labeled%20and%20Unlabeled%20Documents%20Using%20EM’>, pp. 1-34.
“String Matching Algorithms,” Vilnius University, Department of Computer Science, Located on the internet at <www.mif.vu.lt/cs2/courses/ds99fa6.pdf>, pp. 1-
Hickman Palermo & Truong & Becker LLP
Nicholes Christian A.
Smith Garrett
Vo Tim T.
Yahoo ! Inc.
LandOfFree
Record boundary identification and extraction through... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Record boundary identification and extraction through..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Record boundary identification and extraction through... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-4125908