Integrated retrieval scheme for retrieving semi-structured...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000

Reexamination Certificate

active

06424980

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a retrieval technique applied to an open network environment that involves a plurality of semi-structured documents and search engines. In particular, the present invention relates to an integrated retrieval scheme by managing the location data, document structure data, item data, presentation style data, etc., to provide a unified interface for retrieving required information item by item from a plurality of semi-structured documents irrespective of differences among the locations, document structures, elements, input forms of search engines.
2. Description of the Prior Art
Increasing performance and decreasing cost in personal computers, improvements in network technology, and the growth of inexpensive network providers are vitalizing open networks, in particular, the Internet. Many information providers employ HTML (hypertext markup language), that is description language of hytertext for realizing easy contents creation, to transmit various informations to users through the open networks. The number of information providers is increasing due to an exploding increase in information consumers. This results in accumulating various kinds of information in the networks, and it is required to efficiently provide each consumer with necessary information from among the accumulated pieces of information.
The consumers want to entirely retrieve desired information from across information sources. It is hardly granted because information accumulated in the open networks is mostly in HTML documents that have mutually different structures, presentation styles, or search formats to retrieve devised information from across different information sources.
Information retrieval apparatus, so called, search engines are widely used with respect to retrieving HTML documents scattered over the network. Here, the search engine is a generic term for system retrieving certain information through input form.
FIG. 1
shows an information retrieval technique according to a prior art using URL search engine. The URL search engine is a search engine returning URL as search result with respect to query with keyword or conditional term. For example, a user has an interest in “a PC of 100,000 yen or below.” The user enters keywords into an URL search engine.
FIG. 2
shows an example of an URL search engine according to a prior art. The URL search engine
900
has a keyword index
910
that contains keywords and locations, i.e., URLs related to HTML documents spreading over networks, the keyword index
910
is registered in advance. A search processor
930
searches the keyword index
910
for the keywords entered by the user and returns a list of URLs and outlines, the URL indicates location of HTML documents that contain the entered keywords and its synonym. Returning to
FIG. 1
, the user accesses the returned HTML documents one by one to find out necessary information. In this way, first, the users had to find out the locations of HTML documents that may contain necessary information by wide document search, and then inspect each of the HTML documents in obtained URL list for the necessary information when obtaining the information from HTML documents of which is unknown, so that it needs long time and labor to obtain necessary information. The users must spend much time and labor before they get necessary information. In addition, the prior arts are incapable of collectively retrieving information from across a plurality of HTML documents.
The prior arts may find out the locations of HTML documents that contain given keywords and the synonyms thereof but are unable to collect information item by item by collectively retrieving information involved in HTML documents. The prior arts are-unable to set conditions on search results. For example, they are unable to filter search results by date. And, when using URL search engine that provides search interface for each HTML document as input form, users must take into account such individual form input interface for each URL search engine and access each URL search engine one by one.
More particularly, HTML documents employed in on-line shops of electronic commerce frequently show the product information such as names and prices with list description of table or clause style that includes one meaningful clustered data. There are demands to retrieve information collectively among these HTML documents of on-line shops. For example, a user may want to retrieve information about shops that offer the lowest price for a specific product. In this case, the user enters the name, maker, category, etc., of the product as keywords. Then, the prior art of
FIG. 1
provides the user with the locations of HTML documents related to the keywords. The user accesses the HTML documents one by one to check to see if they offer the product under preferable conditions. The prior art of
FIG. 1
, however, searches the full text of each HTML document for the entered keywords without considering elements that form the HTML document, and therefore, tends to retrieve a lot of irrelevant data for the user. Accordingly, the user must spend much time and labor to find out the necessary information from among the HTML documents retrieved by the prior art.
The prior arts are incapable of retrieving information from a given HTML document item by item. For example, they are unable to extract the price, image, maker, etc., of a given product from a given HTML document containing product information table. The prior arts are unable to extract the name, phone number, address, etc., of each shop from a given HTML document containing claused-shop information. The prior arts are unable to set conditions such as date to filter results retrieved from HTML documents.
There is a conventional technique that creates a hypothetical database by mapping the internal structure of each document and relationships between documents into unique models, to extract itemized pieces of information. This technique was disclosed by N. Ashish and C. A. Knoblock in “Semi-automatic wrapper generation for internet information sources,” Proceedings of Cooperative Information Systems, 1997. This technique considers a portion in HTML document as meaningful information, the portion has specific tags such as TITLE tag such as size, color, typestyle (e.g., bold and italic), and extracts these information automatically. This technique cover a case that minimum cluster of certain information is described in one HTML document, and a plurality of the HTML documents are described in mutually same format. This technique is, for example, effective when regionalized weather information is described in different HTML documents. However, this technique doesn't take into account a case that information is described as a list description such as table or clause in one HTML document. Accordingly, this technique is unable to be applied to the above case.
J. Hammer, H. Garcia-Molina, J. Cho, R. Araha, and A. Crespo disclosed another technique in “Extracting semistructured information from the web,” Workshop on Management of Semistructured Data, 1997. This technique creates a hypothetical database by employing an unique OEM data model, and manage relationship between the database and various information sources, and therefore, retrieve information from heterogeneous web sources integratively. This technique employs template file depending on HTML tag description rule for HTML document to manage above relationship. However, in this technique, modification in HTML document affect hypothetical database and also modification in hypothetical database affect application. Accordingly, this technique need much labor for management and maintenance of system.
There are no standards for HTML descriptions used for information providing such as products handled by on-line shops. Namely, on-line shops are using individual HTML documents. This will be explained.
HTML documents prepared by on-line shops have different document structures. For example, a shop A empl

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Integrated retrieval scheme for retrieving semi-structured... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Integrated retrieval scheme for retrieving semi-structured..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Integrated retrieval scheme for retrieving semi-structured... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2917452

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.