Method and apparatus for extracting data from data sources...

Data processing: artificial intelligence – Machine learning

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C706S047000, C706S025000, C707S793000

Reexamination Certificate

active

06516308

ABSTRACT:

TECHNICAL FIELD
The present invention is directed to a method and apparatus for extracting data from data sources on a network and, more particularly, to a method and apparatus for learning general data extraction heuristics from known data extraction programs for respective data sources to obtain a general data extraction procedure.
BACKGROUND OF THE INVENTION
Computer networks are widely used to facilitate the exchange of information. A network may be a local area network (LAN), a wide-area network (WAN), a corporate Intranet, or the Internet.
The Internet is a series of inter-connected networks. Users connected to the Internet have access to the vast amount of information found on these networks. Online servers and Internet providers allow users to search the World Wide Web (Web), a globally connected network on the Internet, using software programs known as search engines. The Web is a collection of Hypertext Mark-Up Language (HTML) documents on computers that are distributed over the Internet. The collection of Web pages represents one of the largest databases in the world. However, accessing information on individual Web pages is difficult because Web pages are not a structured source of data. There is no standard organization of information provided by a Web page, as there is in traditional databases.
Attempts have been made to address the problem of accessing data from Web pages. For example, information integration systems have been developed to allow a user to query structured information that has been extracted from the Web and stored in a knowledge base. In such systems, information is extracted from Web pages using special-purpose programs or “wrappers”. These special-purpose programs convert Web pages into an appropriate format for the knowledge base. In order to extract data from a particular Web page, a user must write a wrapper, which is specific to the format of that Web page. Therefore, a different wrapper must be written for the format of each Web page that is accessed. Because data can be presented in many different formats, and because Web pages frequently change, building and maintaining wrappers and information integration systems is time-consuming and tedious.
A number of proposals have been made for reducing the cost of building wrappers. Data exchange standards such as the extensible Markup Language (XML) have promise, but such standards are not yet widely used. In addition, Web information sources using legacy formats, like HTML, will be common for some time, and therefore, extraction methods must be able to extract information from these legacy formats. Special languages for writing wrappers and semi-automated tools for wrapper construction have been proposed, as well as systems that allow wrappers to be trained from examples. However, none of these proposals eliminate the human effort involved in creating a wrapper for a Web page. Moreover, the training methods are directed to learning a wrapper for Web pages with a single, specific format. Consequently, a new training process is required for each Web page format.
More particularly, when a learning system is used, for example, it is necessary for a person to label the samples given to the learning algorithm. More particularly, a user must label the first few items that should be extracted from the particular Web page starting from the top of the page. These are assumed to be a complete list of items to be extracted up to this point. That is, it is assumed that any unmarked text preceding the last marked item should not be extracted. The learning system then learns a wrapper from these examples, and uses it to extract data from the remainder of the Web page. The learned wrapper can be used for other Web pages with the same format as the page used in training. Therefore, in the learning system, human input is required to determine the page-specific wrapper.
These problems are not limited to retrieving data from HTML documents. These problems exist for documents found on any network.
Therefore, a general, page-independent data extraction procedure was needed to enable a user to easily and accurately extract data from data sources having many different formats. Additionally, an improved format-specific data extraction procedure was needed to accurately extract data from data sources. A procedure was also needed for determining a ranked list of possible data extraction procedures available for accurately extracting data from a data source. The present invention was developed to accomplish these and other objectives.
SUMMARY OF THE INVENTION
In view of the foregoing, it is a principal object of the present invention to provide a method and apparatus which eliminates the deficiencies of the prior art.
It is a further object of the present invention to provide a method and apparatus for learning general data extraction heuristics to generate a general data extraction procedure to enable a user to extract data from a data source on a network, regardless of the format of the data source.
It is another object of the present invention to provide a method and apparatus for learning a general data extraction procedure and for using this procedure to learn a format-specific wrapper.
It is yet a further object of the present invention to provide a method and apparatus for generating a ranked list of wrappers available for accurately extracting data for a particular data source on a network.
These and other objects are achieved by the present invention, which according to one aspect, provides a method and apparatus for learning a general data extraction procedure from a set of working wrappers and the data sources they correctly wrap. New data sources that are correctly wrapped by the learned procedure can be incorporated into a knowledge base.
According to another aspect of the present invention, a method and apparatus are provided for using the learned general data extraction heuristics for the general procedure to learn specific data extraction procedures for data sources, respectively.
According to yet another aspect of the present invention, a list of possible wrappers for a data source is generated, where the wrappers in the list are ranked according to performance level.


REFERENCES:
patent: 5719692 (1998-02-01), Cohen
patent: 5826258 (1998-10-01), Gupta et al.
patent: 6182058 (2001-01-01), Kohavi
patent: 6295533 (2001-09-01), Cohen
patent: 6418432 (2002-07-01), Cohen et al.
Hearst et al., Information Integration, IEEE Intelligent Systems, Sep. 1998, vol. 13, Iss. 5, pp. 12-24.*
Drucker et al., Support Vector Machines for Spam Categorization, IEEE Transactions on Neural Networks, Sep. 1999 vol. 10, No. 5, pp. 1048-1054.*
Feldman, R., Mining Unstructured Data, Tutorial Notes of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 182-236.*
Cohen et al., Context-Sensitive Learning Methods for Text Categorization, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp. 307-315.*
Lee et al., Mining in a Data-Flow Environment: Experience in Network Intrusion Detection, Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 114-124.*
Ng et al., Feature Selection, Perceptron Learning, and a Usability Case Study for Text Categorization, Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1997, pp. 67-73.*
Kushmerick, N., Learning to Remove Internet Advertisements, Proceedings of the 3rd Annual Conference on Autonomous Agents, 1999, pp. 175-181.*
Cohen, W., Learning Rules that Classify E-mail, Advances in Inductive Logic Programming, IOS Press, 1996, pp. 124-143.*
Cohen et al., Learning to Query the Web, Advances in Inductive Logic Programming, IOS Press, pp. 124-143.*
Sasaki et al., Rule-Based Text Categorization Using Hierarchical Categories, 1998, IEEE International Conference on System Man and Cybernetics, Oct. 1998, vol. 3, pp. 2827-2830.*
Li et al., Text Classification Using

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for extracting data from data sources... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for extracting data from data sources..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for extracting data from data sources... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3142873

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.