Method and system for generating structured data from...

Data processing: presentation processing of document – operator i – Presentation processing of document – Layout

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C715S252000, C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06782505

ABSTRACT:

TECHNICAL FIELD OF THE INVENTION
The present invention relates generally to data acquisition and structuring systems and methods, and more particularly, a system and method for generating structured data outputs from semi-structured data inputs.
BACKGROUND OF THE INVENTION
The general field of this invention relates to generating structured data outputs from semi-structured data inputs. A particular application of the invention is acquiring and structuring data to form virtual internet databases. Virtual internet databases are databases whose content is owned, stored and managed on servers distributed across a computer network.
Recently, internet usage and access has increased markedly. The availability and quantity of information on the internet has also increased. Many software products that can produce printed reports can now produce WEB reports. These products produce reports that may be displayed on a WEB page. This is accomplished by embedding the text of the report within the computer language called HTML. Although posted reports and information appear as data on the WEB page, this HTML representation is not a data representation. Rather, the WEB browser serves as a vehicle to display information much like that of a page in a textbook. This presents the problem of incompatibility between the HTML representation and the PC desktop and server applications. Ultimately, the current practice of employing WEB browsers has reduced PCs back to “dumb” terminals. The graphics may be exciting, but functionally all the computing power is limited to providing users with little more than a sophisticated data viewing window.
Several methods have been developed to address the problem of moving semi-structured data from the internet to a PC or server application. These methods include ad hoc engineering methods, Graphical User Interface (GUI) methods, and machine learning methods.
Ad hoc methods entail writing specialized parsing programs in a language such as PERL or LEX to extract the necessary information. These types of programs are called wrappers. A wrapper is a software method that converts data such as HTML code into structured data for further processing. These types of programs employ the use of regular expressions in the parsing process. Unfortunately, these ad hoc methods are labor intensive. Depending on the skill of the programmer and the complexity of the particular job, these methods can take days to develop. Also, these methods are not an option for an average internet user with no formal training or knowledge of HTML and programming methods.
Due to the tedious nature of custom wrapper design, further methods have been developed that employ GUIs to facilitate the wrapper generation. The GUI hides all the engineering details beyond the extracted data pattern definitions. Like the ad hoc methods discussed above; these packages implement regular expression parsing algorithms. In general these methods require some knowledge of both HTML and regular expressions, therefore they may not be suitable to some internet users.
Due to the use of regular expressions, both ad hoc methods and GUI methods can result in what is called brittle parses. Brittle parses result when changes in format of the HTML page cause the parse to fail. A single format change is not guaranteed to break the parse, but the likelihood is sufficiently high as to prevent any guarantees of robust behavior.
Recently, machine learning methods have been developed to address the need for engineering skills in the development of wrappers. Given a set of similar WEB pages and an example of the data to be parsed from each page, these methods automatically generate a wrapper. Unfortunately, these methods require a large number of examples to reliably produce wrappers. An example of such a method can be found in
A Hierarchical Approach to Wrapper Induction
, Muslea, et al. (1999). This method may require 8-10 examples to produce the wrappers. The generated wrappers are based on regular expression techniques and are brittle. Although these wrappers may work for format changes known prior to wrapper generation, they may fail on empirical format changes as the regular expression based methods discussed above.
Ideally, it is desirable to develop a method for a user to gain access to semi-structured data for a PC or server application without requiring the user to have previous knowledge HTML or regular expressions. In addition, it is advantageous if the method does not require the enumeration of examples covering possible format changes.
SUMMARY OF THE INVENTION
The present invention provides a system and method for acquiring and structuring data from semi-structured data sources that substantially eliminates or reduces disadvantages and problems associated with previously developed systems and methods used for developing structured data sources from on-line sources such as the Internet, intranets, or other network systems.
More specifically, the present invention provides a system for generating structured data outputs from semi-structured data sources. The steps of this method include generating an example output from an example generator. The example output is generated in response to the acquisition of a sequence of annotated strings. The annotated strings are generated in response to the acquisition and modification of as little as one data example and a corresponding coarse structure from a predetermined input source. Also, a second sequence of annotated strings in generated from input from a semi-structured data source. Both the example output and second sequence of annotated strings are input to an acquisition engine that implements a grammar layer incorporating a top-down parsing method and a comparison layer. The structured data outputs are generated through the cooperation of the comparison layer and the grammar layer.
The present invention provides an important technical advantage in that it does not require the user to have knowledge of HTML or knowledge of pattern matching languages. The graphical interface guides the user through a set-up phase and completely hides all technical details.
The present invention provides an important technical advantage in that it requires only one single data example. Once this set-up process is complete, the acquisition engine can be pointed to related WEB pages, as well as up-dated versions of the same page, and it will automatically extract data and route it to applications.
The present invention provides yet another technical advantage in that the system is able to cope with the format changes from the source pages, including changes in the order of data values. Thus, the technology produces reliable results even when the data sources are re-formatted, updated or amended by the content providers.


REFERENCES:
patent: 4947438 (1990-08-01), Paeseler
patent: 5826258 (1998-10-01), Gupta et al.
patent: 5864863 (1999-01-01), Burrows
patent: 5913214 (1999-06-01), Madnick et al.
patent: 5999939 (1999-12-01), de Hilster et al.
patent: 6021409 (2000-02-01), Burrows
patent: 6526426 (2003-02-01), Lakritz
Ashish, N. et al., Semi-automatic wrapper generation for Internet information sources, IEEE Cooperative Information Systems Jun. 27, 1997, pp. 160-169.*
Huck, G., et al., Jedi: extracting and synthesizing information from the Web, IEEE Cooperative Information Systems 1998, pp. 32-41, Aug. 1998.*
Gruser, J. et al., Wrapper generation for Web accessible data sources, IEEE Cooperative Information Systems 1998, pp. 14-23, Aug. 1998.*
Weigel, A. et al., Lexical postprocessing by heuristic search and automatic determination of the edit costs, IEEE Document Analysis and Recognition, 1995, pp. 857-860, Aug. 1995.*
Hardy, Darren R. et al, Customized information extraction as a basis for resource discovery, ACM Transactions on Computer Systems, vol. 14, Issue 2, 1996, pp. 171-199, 1996.*
Aggarwal, S. et al., WIRE-a WWW-based information retrieval and extraction system, IEEE Database and Expert Systems Applications, pp. 887-892, Aug. 1998.*
Kurz, A. et al., Data Warehousing within intranet: pr

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and system for generating structured data from... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and system for generating structured data from..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for generating structured data from... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3299514

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.