System and method for automatically gathering dynamic...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C709S217000, C709S218000, C715S252000

Reexamination Certificate

active

06665658

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention is related to the area of Internet search technologies and resource gathering using web crawling techniques, and in particular to a method and apparatus for automatically gathering dynamic content and resources on the world wide web by simulating user interaction and managing session information.
2. Description of Related Art
In the early beginning of the Internet, most web sites served static pages and content. The format of these pages are typically represented in HTML (Hypertext Markup Language), and the contents do not change unless modified by the site administrator or provider. Internet search providers use standard web crawling techniques to provide search facilities to collect static data from these websites and to summarize and index the data. The trend today is moving forward to dynamically created web pages using scripting technologies on the server side (e.g. Active Server Pages, CGI, etc.). Database content is made available through web gateways. Web gateways process information requests and return the requested page or document to the user. Standard web crawling techniques are not sufficient to gather dynamic content.
Some websites generate dynamic content and require user input/interaction to access the data. These sites are typically shopping or password protected sites providing personalization features based on specific user input. In order to keep track of user preferences, personal data, and passwords, these sites issue “cookies” to store status information. A “cookie” is data that is stored on a users machine and is read by the server that sets it. The server reads the cookie when the user returns to a site and the site is then personalized with a greeting such as “Welcome Back John Doe”. This user will not be able to navigate the site unless that cookie is read from their machine.
The main problem is that these dynamic web sites provide valuable content and information, which is not possible to automatically gather and index using existing technologies. However, it would be very valuable if this data were available and indexed for other meta search engines to search. For example, consider a database of books found at the website of “AMAZON.COM”® (http://www.amazon.com). This database contains data on millions of books, which may include the name of the book, the author, as well as an abstract or summary of the book. But more importantly, the database also contains reviews about these books, written by people who actually read the book. This site makes extensive use of personalization features and cookies, which we can describe as an interactive behavior containing session information. When a user or client visits the “AMAZON.COM” site, the “AMAZON.COM” server tries to set a “cookie”, which has to be accepted by the client. Many web browsers have automatic functionality built in which will handle this, and asks the user whether to accept or reject the cookie request. The standard web crawler is not able to systematically crawl the site and replicate the database because of the need for user interaction. There is no mechanism to simulate the user's behavior, or interaction, during a typical search session.
There are many more databases of books, such as “BarnesAndNoble.com”, and “FatBrain.com.” Essentially, the basic book data they keep is similar, however any additional information they provide may vary and could provide useful insights to one seeking information on a particular book. Thus, it would be of great benefit for a web browser or crawler to be able to navigate these sites, among others, and automatically retrieve and process the content and information available.
In another example, a domain specific search engine like “jCentral” from IBM, (http://www.ibm.com/developer/ibm), which is focused on the programming language “Java”, might be interested in providing a search feature for books about “Java.” So it would be a benefit for software developers if “jCentral” could create an index of the data on “Java” which is stored on “AMAZON.COM”, and provide a domain specific search for interested “Java” developers. In order for “jcentral” to be able to perform such a search on a website such as “AMAZON.COM”, it is necessary for “jCentral” to be able to navigate and interact with the dynamic website. However, standard web crawling techniques cannot automatically simulate the necessary user interaction required to navigate the sites and retrieve the desired information and content from the website.
Bearing in mind the problems and deficiencies of the prior art, it is therefore an object of the present invention to provide an apparatus and method to automatically simulate user interaction with a dynamic website.
It is another object of the present invention to provide an apparatus and method for a webcrawler to automatically simulate interactive behavior of a user in order to search and query dynamic websites.
A further object of the invention is to provide an apparatus and method for a webcrawler to automatically simulate interactive behavior of a user in order to gather and extract information from a dynamic website.
Still other objects and advantages of the invention will in part be obvious and will in part be apparent from the specification.
SUMMARY OF THE INVENTION
The above and other objects and advantages, which will be apparent to one of skill in the art, are achieved in the present invention which is directed to, in a first aspect, an automated method of gathering dynamic content and resources on the world wide web by simulating user interaction and managing session information. The method comprises the steps of identifying at least one uniform resource locator (“URL”), a document type definition (“DTD”) for the URL and at least one search topic to be searched on the URL. The URL is queried with the URL, DTD and at least one search topic and the results are returned. In the preferred embodiment, after retrieving at least one result of the query, it is determined if there is another search topic to search the URL with. If so, another query of the URL is performed with the additional search topic, and the results are returned. In the preferred embodiment, these steps are repeated until all search topics have been searched on the site.
In the preferred embodiment, after the step of identifying at least one search topic to be searched, a query template is formed using the URL, DTD and search topic to complete a search query string. The search query string is adapted to be submitted to the URL to perform a hypertext transfer protocol request.
After the step of retrieving at least one search result, it is also preferred to determine if additional search results are available, and if so, to perform a page navigation to retrieve the additional search results. This page navigation may be repeated until all search results have been retrieved.
In another aspect, the present invention is directed to an article of manufacture comprising a computer usable medium having computer readable program code means for automatically gathering dynamic content and resources on the world wide web by simulating user interaction and managing session information. The computer readable program code means in the article of manufacture comprises computer readable program code means to identify a URL for a website to be queried, computer readable program code means to identify a data type definition for the URL, computer readable program code means to identify at least one search topic to be searched on the URL, and computer readable program code means to query the URL with the DTD and at least one search topic, and computer readable program code means to retrieve the results of the query.
In the preferred embodiment, the article further comprises computer readable program code means to determine if the URL is to be searched with additional search topics and computer readable program code means to perform additional queries of the URL until all topics have been searched, and computer readable program c

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for automatically gathering dynamic... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for automatically gathering dynamic..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for automatically gathering dynamic... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3140998

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.