Wrapper induction by hierarchical data analysis

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06606625

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to information wrapper generating methods and more particularly to machine learning method for wrapper construction that enables easier generation of wrappers.
2. Description of the Related Art
With the expansion of the Web, computer users have gained access to a large variety of comprehensive information repositories. However, the Web is based on a browsing paradigm that makes it difficult to retrieve and integrate data from multiple sources. The most recent generation of information agents (e.g., WHIRL, Ariadne, and Information Manifold) address this problem by enabling information from pre-specified sets of Web sites to be accessed via database-like queries. For instance, the query “What seafood restaurants in L.A. have prices below $20 and accept the Visa credit card?” may be considered as an example. Assume that there are two information sources that provide information about L.A. restaurants: the Zagat Guide and L.A. Weekly (see FIG.
1
). To answer this query, an information agent could use Zagat to identify seafood restaurants under $20 and then use L.A. Weekly to check which of these accepts Visa.
Information agents generally rely on “wrappers” to extract information from semistructured Web pages. A page is semistructured if the desired information can be located using a concise, formal grammar. Each wrapper consists of a set of extraction rules and the code required to apply those rules to the semistructured Web pages. Some systems, such as TSIMMIS and ARANEUS depend on humans to write the necessary grammar rules. However, there are several reasons why this is undesirable. Writing extraction rules is tedious, time consuming, and requires a high level of expertise. These difficulties are multiplied when an application domain involves a large number of existing sources or the format of the source documents changes over time. It would be much more advantageous to have information agents that could accommodate the flexibility and spontaneous nature of the Web so that desired information can be gleaned from any format of presentation.
Research on learning extraction rules has occurred mainly in two contexts: creating wrappers for information agents and developing general purpose information extraction systems for natural language text. The former are primarily used for semistructured information sources, and their extraction rules rely heavily on the regularities in the structure of the documents; the latter are applied to free text documents and use extraction patterns that are based on syntactic and semantic information.
With the increasing interest in accessing Web-based information sources, a significant number of research projects depend on wrappers to retrieve the relevant data. A wide variety of languages have been developed for manually writing wrappers (i.e., where the extraction rules are written by a human expert), from procedural languages and Perl scripts to pattern matching and LL(k) grammars. Even though these systems offer fairly expressive extraction languages, manual wrapper generation is a tedious, time consuming task that requires a high level of expertise. Furthermore, the rules for the wrappers have to be rewritten whenever the sources suffer format changes. In order to help the users cope with these difficulties, Ashish and Knoblock proposed an expert system approach that uses a fixed set of heuristics of the type “look for bold or italicized strings”.
The wrapper induction techniques introduced in WIEN (Kushmerick, 1997) are better fit to frequent format changes because they rely on learning techniques to generate the extraction rules. Compared to the manual wrapper generation, Kushmerick's approach has the advantage of dramatically reducing both the time and the effort required to wrap a source; however, his extraction language is significantly less expressive than the ones provided by the manual approaches. In fact, the WIEN extraction language is a 1-disjunctive LA (landmark automaton, below) that is interpreted as a SkipTo( ) and does not allow the use of wildcards. There are several other important differences between STALKER (the present invention) and WIEN. First, as WIEN learns the landmarks by searching common prefixes at the character level, it needs more training examples than STALKER. Second, WIEN cannot wrap sources in which some items are missing or appearing in various orders. Last but not least, STALKER can handle EC (embedded catalog) trees of arbitrary depths, while WIEN's approach to nested documents turn out to be prohibitive in terms of CPU time.
SoftMealy (Hsu and Dung) uses a wrapper induction algorithm that generates extraction rules expressed as finite transducers. The SoftMealy rules are more general than the WIEN ones because they use wildcards and they can handle both missing items and items appearing in various orders. The SoftMealy extraction language is a k-disjunctive LA, where each disjunct is either a SkipTo( )Next Landmark( ) or a single SkipTo( ). As SoftMealy does not use either multiple SkipTo( )s nor SkipUntil( )s, it follows that its extraction rules are strictly less expressive than STALKER's. Finally, SoftMealy has one additional drawback: in order to deal with missing items and various orderings of items, SoftMealy has to see training examples that include each possible ordering of the items.
In contrast to information agents, most general purpose information extraction systems are based on unstructured text, and therefore the extraction techniques text are based on linguistic constraints. However, there are three such systems that are somewhat related to STALKER: WHISK, Rapier, and SRV. The extraction rules induced by Rapier and SRV can use the landmarks that immediately precede and/or follow the item to be extracted, while WHISK is capable of using multiple landmarks. But, similarly to STALKER and unlike WHISK, Rapier and SRV extract a particular item independently of the other relevant items. It follows that WHISK has the same drawback as SoftMealy: in order to handle correctly missing items and items that appear in various orders, WHISK must see training examples for each possible ordering of the items. None of these three can handle embedded data though all use powerful linguistic constraints that are beyond STALKER's capabilities.
SUMMARY OF THE INVENTION
The present invention provides means by which extraction rules for wrappers may be automatically generated when correct examples have been provided previously. Using a graphical user interface, a user marks or indicates information that is desired from a realm of similar data collections. For example, if one set of Web pages is marked for addresses, the graphical user interface (GUI) transmits or passes the relevant token sequences identifying the borders, perimeters, and/or prefix/suffix of the indicated portion to a rule-generating program/system denominated herein as STALKER. STALKER then takes these collections of token sequences in the context that they identify certain data fields of interest, in this case addresses. STALKER then takes the examples and generates rules by means of the token sequences and derivatives thereof in order to determine extraction rules for wrappers.
This process of extracting rules for wrappers is highly advantageous as the wrappers are then able to go out to other data collections, such as other Web pages, and extract the address or other desired information. This makes available the coherent, controlled, predictable and facile operation-generation of information agents. Such agents can be unleashed upon data collections to extract the desired information. An information automaton is then achievable that may allow the user to gather information from an identified and semi-structured source. While suffering some limitations, the present invention may provide a stepping stone to an ultimate goal of harvesting information from unpredictable, but stable, information sources such as the Internet itself. The user does no

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Wrapper induction by hierarchical data analysis does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Wrapper induction by hierarchical data analysis, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Wrapper induction by hierarchical data analysis will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3100034

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.