System and method of automatic wrapper grammar generation

Data processing: presentation processing of document – operator i – Presentation processing of document – Layout

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C715S252000, C715S252000, C715S252000

Reexamination Certificate

active

06792576

ABSTRACT:

FIELD OF THE INVENTION
This invention relates generally to information retrieval and integration systems, and more particularly, to the creation and generation of wrapper grammars for extracting and regenerating information from documents stored in networked repositories.
BACKGROUND OF THE INVENTION
The World Wide Web (the “web” or “WWW”) is an architectural framework for accessing documents (or web pages) stored on a worldwide network of distributed servers called the Internet. An information source is any networked repository, e.g., a corporate database, a WWW site or any other processing service. Documents stored on the Internet are defined as web pages. The architectural framework of the web integrates web pages stored on the Internet using links. Web pages consist of elements that may include text, graphics, images, video and audio. All web pages or documents sent over the Web are prepared using HTML (hypertext markup language) format or structure. An HTML file includes elements describing the document's content as well as numerous markup instructions, which are used to display the document to the user on a display.
Access to online information via the Web is exploding. Search engines must integrate a huge variety of repositories storing this information in heterogeneous formats. While all files sent over the Web are prepared using HTML format, the heterogeneity issue remains both in terms of search query formats and search result formats. Search engines must provide for homogeneous access (to the underlying heterogeneity of the information) and allow for homogenous presentation of the information found.
A wrapper is a type of interface or container that is tied to data; it encapsulates and hides the intricacies of a remote information source in accordance with a set of rules known as a grammar or a wrapper grammar, providing two functions to an information broker. First, wrappers are used to translate a client query to a corresponding one that the remote information source will understand. Wrappers are associated with the particular information source. For example, HTTP wrappers interact with HTTP servers and HTML documents; JDBC wrappers work with ODBC-compliant databases; and DMA wrappers work with DMA-compliant document management systems.
Second, wrappers are used by search engines to extract the information stored in the HTML files representing the individual web pages; the wrapper scans the HTML files returned by the search engine, drops the markup instructions and extracts the information related to the query. If an information broker is involved, the wrapper parses (or processes) the results in a form that can be interpreted and filtered by the information broker. Then the wrapper takes the search answers, either from the different document repositories or from the information broker, puts them in a new format that can be viewed by the user. Extraction and parsing is done in accordance with the grammar or rules for the particular type of response file.
Unfortunately, document repositories and providers HTML' response files are generated for the convenience of visualization rather than information extraction. Moreover, response files from different information providers vary widely both in structure and in format: HTML, ODBC, DMA. Even among HTML providers, the format may vary. For example, some providers may generate HTML tags to separate each attribute of the document (author, title, journal, and date of publication). Other providers may link attributes, such as author and title, together, separating them not by an HTML tag, but by a grammatical separator such as a comma or semicolon.
As a result, the analysis of response files and the creation of wrapper grammars in most search engines require human intervention. As the Web providers evolve over time and as individual documents may change over time, human intervention is also needed each time the response structure or markup is changed. This makes the process of the wrapper grammar creation and maintenance extremely time-consuming and error-prone.
Automatic induction (generation) of wrapper grammars has been studied in the literature. For example, Chidlovskii et al, “Towards Sophisticated Wrapping of Web-based Information Repositories,”
Proc. Int'l RIAO
'97
Conference
, Montreal, pp. 123-135, 1997, describe a semi-automatic approach for wrapping of Web-based information repositories involving high-level text-processing tools based on grammar rules. While this method allows processing of any regular search result by a high-level grammar, it is not HTML oriented and thus prone to errors or stopping mid-analysis.
N. Kushmerick,
Wrapper Induction for Information Extraction
, Ph.D. Dissertation, Dept. Computer Science and Eng., University of Washington, Seattle, Wash. and
Wrapper Induction; Efficiency and Expressiveness
, AAAI'98 Workshop on AI and Information Integration, AAAI-98, identified some subclasses of HTML wrapper grammars which can be efficiently inferred. These particular subclasses assume a tabular structure of items on the response page. The wrapper grammar inference is therefore reduced to the efficient detection of tag sequences preceding each attribute in such a tabular form.
I. Muslea et al, STALKER:
Learning Extraction Rules for Semistructured, Web
-
based Information Sources
, AAAI'98 Workshop on AI and Information Integration, 1998 considered a wider set of HTML wrapper grammars. This method goes beyond tables and also induces wrapper grammars in cases when some attributes are missing or their appearance order changes on the response page.
Despite a reported success in about 65% by N. Kushmerick and 75% by I. Muslea et al. of real information providers, both approaches have obvious limitations, for example, in treating disjunction (A or B) and “list of list” or “nested lists” cases.
In addition to the limitations of the above approaches, two main problems affect automatic wrapper grammar generation. The first problem is the ambiguous markup of response pages by some Web-based information providers, which makes automatic wrapper grammar generation difficult at best and in some cases impossible. For example, a Web provider reports a list of answers, with each item containing two string-value attributes t
1
and t
2
, with t
2
being optional. Additionally, the provider may use a unique format for each attribute. That is, the HTML file structure of a response page from this Web provider looks as follows: (<i>String (t
1
)<i>(<i>String (t
2
)</i>))+. Assume the wrapper grammar has been generated, it has correctly guessed this format and it receives the following response from the provider: <i>string
1
</i><i>string
2
</i>. While the wrapper grammar uniquely assigns string
1
to attribute t
1
, recognizing string
2
is ambiguous; the wrapper grammar may assign string
2
to either attribute t
1
or attribute t
2
(nondeterministic choice). Clearly, such a behavior is unacceptable for correct attribute extraction and t
2
should therefore be excluded during the wrapper grammar generation.
The second main problem in automatically generated wrapper grammars is over-generalization. For example, a grammar like (<HTML>|</HTML>|<body>|</body>| . . . |String)* will accept any HTML file, but it is incapable of properly assigning tokens (or specific values) to the defined user attributes (Title, Author, etc.). Over-generalization originates from a grammar inference mechanism which detects some common fragments in the sample input strings and generalizes them by merging them into a single attribute. Actually, over-generalization is related to inadequate or missing control over merges, which produces a general grammar that extracts more than the allowed for correct attribute.
There is a need for a method of automatic wrapper grammar generation that provides unambiguous attribute assignment. There is a further need for a method of automatic wrapper grammar generation w

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method of automatic wrapper grammar generation does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method of automatic wrapper grammar generation, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method of automatic wrapper grammar generation will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3212002

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.