Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1997-12-02
2001-10-16
Alam, Hosain T. (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06304870
ABSTRACT:
1. FIELD OF THE INVENTION
This invention relates to retrieval of textual information. The information may be available in files stored locally or files that are accessed over public or private networks. A specific application of this invention is in providing assistance in accessing on-line electronic stores by automatically retrieving product descriptions in response to a user product query. Numerous other applications will be apparent.
2. BACKGROUND
The exponential growth of data collections, private intranets and the public Internet has produced a daunting labyrinth of increasingly numerous information sources. Searching these sources is often a chore. For example, almost any type of product is now available somewhere on a communication network, but most users cannot find what they seek, and even expert users waste copious time and effort searching for appropriate on-line stores or other product information sources.
One problem is simply the increasingly large number of available sources that are beyond the comprehension of a single user. A second problem, along with this growth in available information, is a commensurate growth in software utilities and methods to manage, access, and present this information. Each utility has a different and often unique interface and set of commands and capabilities, and is appropriate for a different set of users and a different set of information types and sources. Thus, sheer diversity of available utilities creates problems for users comparable to that created by the information explosion. Users are now faced with the twin problems of which tool to use to inquire at which information source.
In the past efforts have been made to provide users with automatic, computer assisted services that can help solve these twin problems of the network revolution. For example, AI researchers have created several prototype software agents that help users with e-mail and netnews filtering (Pattie Maes et al., 1993, Learning interface agents,
Proceedings of AAAI
-93), agents that assist with World Wide Web browsing (H. Lieberman, 1995, Letizia: An agent that assists web browsing,
Proc.
15
th Int. Joint Conf. on A.I
. pp. 924-929; Robert Armstrong et al., 1992, Webwatcher: A learning apprentice for the world wide web,
Working Notes of the AAAI Spring Symposium: Information Gathering from Heterogeneous, Distributed Environments
, pp. 6-12, Stanford University, AAAI Press), agents that schedule meetings (Lisa Dent et al., 1992, A personal learning apprentice,
Proc.
10
th Nat. Conf. on A.I
., pp. 96-103; Pattie Maes, 1994, Agents that reduce work and information overload,
Comm. of the ACM
37(7):31-40, 146; Tom Mitchell et al., 1994, Experience with a learning personal assistant,
Comm. of the ACM
37(7):81-91), and agents that perform internet-related tasks (O. Etzioni et al., 1994, A softbot-based interface to the internet,
CACM
37(7):72-75).
Increasingly, the information such agents need to access is available on the World Wide Web. Unfortunately, even a domain as standardized as the WWW has turned out to pose significant problems for automatic software agents. For one, although Web pages are universally written in Hypertext Markup Language (“HTML”), this language merely defines the format of information display, making no attempt to hint at its meaning or semantic content. Currently, no accepted “semantic markup language” for the Web exists, nor is one likely to be adopted universally. The Internet can be expected to pose even greater problems.
Thus, the advent of intranets, the Internet, and the World Wide Web have posed several fundamental problems for the automatic services or agents designed to assist users to find relevant information. First, no one such service has heretofore provided sufficient additional value to replace the use of a Web browser having access to existing on-line directories or indices such as Yahoo or Lycos. Second, such services have not yet been able to understand and competently parse relevant information from the responses returned from a wide variety of Internet and Web on-line information sources. Third, existing services and agents have not been easy to adapt to the ever-increasing numbers of sources with their ever-changing response formats. This is due to the individualized, hand-coded interface to each Internet service and Web site utilized by existing agents (Yigal Arens et al., 1993, Retrieving and integrating data from multiple information sources,
International Journal on Intelligent and Cooperative Information Systems
2(2):127-158; O. Etzioni et al., 1994, A softbot-based interface to the internet,
CACM
37(7):72-75; B. Krulwich, 1995, Bargain finder agent prototype, Technical report, Anderson Consulting; Alon Y. Levy et al., 1995, Data model and query evaluation in global information systems,
Journal of Intelligent Information Systems, Special Issue on Networked Information Discovery and Retrieval
5(2); Mike Perkowitz et al., 1995, Category translation: Learning to understand information on the internet,
Proc.
15
th Int. Joint Conf. on A.I.
). Preferably, a service or agent should be able to access a new or changed Internet on-line source in order to automatically learn how to retrieve relevant information from the source.
3. SUMMARY OF THE INVENTION
Many Internet sites display their content as a table, with rows indicating various objects, and columns indicating various attributes about the objects. For instance, an on-line store's catalog might display a table containing one row for each product, with three columns: Description, Price, and Manufacturer. At most sites, these tables are displayed using formatting commands such as HTML tags. In addition to such a table, a site's pages usually contain extraneous text such as formatting commands, advertisements or hyper-links to the rest of the store. These pages may be thought of as semi-structured: they are more organized than free-text but not as organized as a database.
For a computer program to use such information, these tables must be identified and their information extracted, while extraneous text is ignored. For instance, to allow automatic shopping at an on-line store, a computer program must extract a page's <Description, Price, Manufacturer>triples, while ignoring any formatting commands or advertisements that might appear.
This invention is concerned with automating such an information extraction process.
It is often straightforward to write a computer program—which is called a wrapper—to perform this process for a particular site. Writing wrappers is straightforward because most sites' pages are generated automatically from a database, and so the pages have a consistent structure. For instance, at a particular store, the Price attributes might always be formatted as:
. . . <TD>$20.95</TD>. . .
One can therefore write a wrapper which scans the pages for occurrences of the delimiter string “<TD>$” (which indicates the start of a Price) and then scanning for the delimiter “</TD>” (which indicates the end of the Price). Applying this procedure to the above text fragment would extract the Price “20.95”. A page's entire content can be extracted by applying similar procedures for the Description and Manufacturer attributes. (As discussed below, this description is highly simplified; real wrappers are more complicated. However, the basic idea—scanning the page for specific delimiters—works for numerous actual sites.)
While it is straightforward to write a wrapper for any particular information source, commercial information searching systems such as shopping systems might access hundreds of resources. Writing wrappers in such a setting is tedious and error-prone.
This invention comprises a technique for automatically constructing wrappers for performing information-extraction from sites such as Internet resources that display relevant information, interspersed with extraneous text fragments, such as HTML formatting commands or advertisements. The invention provides a system for learning
Doorenbos Robert B.
Kushmerick Nicholas
Weld Daniel S.
Alam Hosain T.
Corrielus Jean M.
Pennie & Edmonds LLP
The Board of Regents of the University of Washington, Office of
LandOfFree
Method and apparatus of automatically generating a procedure... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus of automatically generating a procedure..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus of automatically generating a procedure... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2604271