Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-11-16
2004-05-04
Breene, John (Department: 2177)
Data processing: database and file management or data structures
Database design
Data structure types
C715S252000, C345S215000, C345S215000, C345S215000
Reexamination Certificate
active
06732102
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to a method and system for automated browsing and data extraction of data found at global communication network sites such as Web sites that include HTML or XML data.
BACKGROUND OF THE INVENTION
The Internet is becoming the de facto default network for people and computers to connect to each other because of its truly global reach and its free nature. HTML (HyperText Markup Language) is the widely accepted standard for human interaction with the Internet and particularly the World Wide Web (the “Web”). HTML, in conjunction with a browser, allows people to communicate with other people or computers at the other end of the network.
The Internet can also be viewed as a global database. A large amount of valuable business information is present on the Internet as HTML pages. However, HTML pages are meant for human eyes, not for a computer to read them, posing serious limitations on how that information can be used in an automated manner.
HTML Web pages are built as HTML tags within other tags, in effect forming a “tree”. Certain automated browsers interpret the hierarchy and type of tags and render a visual picture of the HTML for a user to view. HTML data-capture technology currently available follows a paradigm of “design” and “run” modes. In design mode, a user (e.g., a designer), through software, locates Web sites and extracts data from those sites, by way of an “example”. The software program saves the example data and in the “run” mode, automatically repeats the example for the new data. However, most Web pages can, and do, change as frequently and as much as their Webmaster desires, sometimes changing the tree hierarchy completely between design time and run time. As a result, reliable extraction of data, including business data, from an HTML page becomes a challenging task.
There are certain known methods for extracting this information. For example, OnDisplay Inc. of San Ramon, Calif. has a “CenterStage eContent” product that can access, integrate and transform data from multiple HTML pages. OnDisplay's HTML data recognition algorithm works by remembering the depth and location of the required business information within the HTML “tree” between the design and run modes.
As another example, Neptunet Inc., of San Jose, Calif., provides for a system comprising a method, whereby, after getting the Web data, all further processing of that data has to be programmatically specified. Neptunet's HTML data recognition algorithm works by remembering the depth and location of the required business information within the HTML “tree” between the design and run modes.
Other HTML data capture mechanisms include methods whereby HTML data extraction is performed by specifying (i.e., hard coding) the exact HTML tag number of the data to be extracted using a programming language such as Visual Basic or Visual C++. The drawbacks of these types of methods is that at the slightest change in the appearance of the Web page, the program has to be changed, making it an impractical solution for reliable data processing solutions.
HTML is a very useful information presentation protocol. It allows visually pleasing formatting and colors to be set for data being presented to make it more understandable. For example, a stock price change can be shown in green color if the stock is going up and in red if the stock is going down, making the change visually and intuitively more understandable.
But more and more, the Internet is also being used for machine to machine (i.e., computer to computer) communications. While HTML is a wonderful mechanism for the purpose of human interaction, it is not ideally-suited for computer to computer communication. It has the main disadvantage for this purpose that there is no way for the data being sent to be described as “what” the data is supposed to represent. For example, a number “85” appearing on a Web stock trading screen in the browser may be the stock price or the share quantity. The data just gets shown in the browser and it is the human being looking at the browser who knows what numbers mean what because of casual context information shown around the data. But in machine to machine communication, the receiving computer lacks the context resolution intelligence and has to be told very specifically that the number “85” is the stock price and not the share quantity.
The need for correct and specific understanding of the data at the receiving computer's end has been conventionally satisfied via EDI (Electronic Data Interchange), where the sending and receiving computers have to be synched up to agree on the sequence, length and format of the data elements that can be sent as a complete message. This mechanism, while it works, is cumbersome because of the prior agreement required between the two computers and hence can be used effectively only in a network of relatively few computers in communication with one another. It does not work in an extremely large network like the Internet.
The void of clarity of data definition in a large network is being filled today by a new Internet protocol called XML (Extensible Markup Language). XML provides a perfect solution to specify explicitly and clearly what each number reaching the receiving computer is supposed to be. XML has a feature called “tags” which go with the data and describe what the data is supposed to be. For example, the stock price will be sent in a XML stream as:
<Stock Price> 85 </Stock Price>
The “/” in the second tag signifies that the data description for that data element is complete. Other tag pairs may follow, describing and giving values of other data elements. This allows computer to computer data exchange without needing a prior agreement between the computers about how the data is formatted or sequenced. additionally, XML is capable of showing relationships between pieces of data using a “tree” or hierarchical structure.
But XML has its own unique problems. While useful as data definition mechanisms, XML tree structures cannot be fed to existing data manipulation mechanisms operating on relational (tabular) data formats using well known languages like SQL.
It is believed that OnDisplay, Neptunet and WebMethods are companies allowing a fairly user-friendly design time specification of XML data interchange between computers, saving the specifications and reapplying them at a later point in time on new data. Several companies offer point-and-click programming environments with varying capabilities. Some are used to generate source code in other programming languages, while others execute the language directly. Examples are Visual Flowcoder by FlowLynx, Software-through-pictures by Aonix, ProGraph by Pictorius, LabView by National Instruments and Sanscript by Northwoods Software. All of these methods lack the critical built-in ability to capture and use Web based (HTML/XML) real-time data.
SUMMARY OF THE INVENTION
One aspect of the present invention provides a computer-implemented method for automated data extraction from a Web site. The method comprising: navigating to a Web site during a design phase; extracting data elements associated with the Web site and producing a visible display corresponding to the extracted data elements; selecting and storing at least one Page ID data element in the display from the data elements; selecting and storing one or more Extraction data elements in the display; selecting and storing at least one Base ID data element having an offset distance from the Extraction elements; setting a tolerance for possible deviation from the offset distance; and renavigating to the Web site during a playback phase and extracting data from the Extraction data elements if the Page ID data element is located in the Web site and if the offset distance of the Base ID data element has not changed by more than the tolerance.
Preferably, the user-specific information is entered into the Web site and used in connection with producing the data to be extracted from the Extraction data elements. The data el
Ali Mohammad
Breene John
InstaKnow.Com Inc.
Lerner David Littenberg Krumholz & Mentlik LLP
LandOfFree
Automated data extraction and reformatting does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Automated data extraction and reformatting, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Automated data extraction and reformatting will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3251089