Method for processing a file to generate a database

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06247018

ABSTRACT:

REFERENCE TO PAPER APPENDIX
The present application includes a paper appendix attached hereto setting forth an exemplary implementation of an embodiment of the method according to the present invention. A portion of the disclosure of the present application contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
FIELD OF THE INVENTION
The present invention relates to a method for automatically processing an HTML document or an ASCII file in order to treat the data contained within a HTML document or ASCII file as a database with one or more database management system tables.
BACKGROUND INFORMATION
The Internet provides tremendous amounts of information that is made available to the public via web sites of various organizations. A web site includes one or more web pages providing information from the sponsoring organization. For example, the United States Congress makes information on pending legislation and votes available to the public on its website. As is known in the art, Hypertext Markup Language (HTML) is the authorized language used on the Internet for creating web pages. HTML includes ASCII text surrounded by HTML commands in angle brackets which can be interpreted by an Internet web browser.
While a tremendous amount of information is available to the public via the Internet, the information is generally organized and presented in a manner selected by the owner or sponsor of the website. However, a user of the website may need to manipulate data available from a website in a particular manner. For example, a user may want to reorganize the data so that it can be manipulated using standard SQL queries. However, current web browsers, such as MICROSOFT INTERNET EXPLORER or NETSCAPE NAVIGATOR do not provide the capability to mine web pages to extract the data contained therein so that the data can be manipulated in the particular manner desired by the user, thus freeing the user from the format provided by the web site.
SUMMARY OF THE INVENTION
According to an exemplary embodiment of the present invention, a method for automatically processing a file, such as a web page or an ASCII file, is provided to treat the file as a database with one or more database tables. For example, the method according to an embodiment of the present invention automatically processes an HTML page or group of related HTML pages in an HTML frameset in order to identify data and treat the data contained within the HTML file(s) as a database with one or more database management system tables. In another embodiment of the method according to the present invention, an ASCII file can be processed to identify data and treat the data contained within the ASCII file as a database with one or more database management system tables.
The method according to the present invention can, for example, retrieve an HTML page or a group of related HTML pages in an HTML frameset from a user specified URL or from a disk file. If the source HTML document contains an HTML <FRAMESET> (e.g., a group of HTML pages each loaded into a separate frame in the browser), the method according to the present invention retrieves the HTML page associated with each frame and thus treats the entire frameset as a single database. The method according to an embodiment of the present invention scans each HTML page for any HTML tables and translates each HTML table into a database table in a database representation of the HTML page (e.g., as a DB2 database management system representation of the HTML page).
According to an embodiment of the present invention, processing is performed on each HTML table identified in an HTML page so that the HTML table can be used in a database representation. For example, if the HTML table contains an HTML <CAPTION> tag, then the caption text is used to generate the database table name. If the HTML table contains HTML <TH> tags (e.g., table header tags), then the table header text is used to generate the database table column names. If the HTML table contains <ROW SPAN> or <COL SPAN> tags (e.g., a label applied to multiple rows or columns as a category label), then the text value of the cell is replicated over the <ROW SPAN> rows or the <COL SPAN> columns to create tables which are consistent with relational database tables. All HTML escape sequences are translated to their corresponding ASCII representations. Any carriage returns and/or line feeds are, for example, removed from the data in the HTML table. Also, all HTML tags are removed from the data except for the <BR> (e.g., break) tag which is translated into, for example, a <CR> <LF> line break in the data. Leading and trailing white spaces are removed from the data in the HTML table and all internal white spaces are compressed into a single space. As a result of the processing of the data in the HTML table, the underlying data in the HTML table can be identified and extracted for inclusion in a database representing the underlying data.
The method according to an embodiment of the present invention can also identify data to be translated into a database table that is contained in a web page (e.g., an HTML document) but is not contained in an HTML table. For example, in an embodiment of the method of the present invention, each HTML page is scanned for any blocks of fixed length lines. A block is defined, for example, by five or more lines of the same length that do not contain a separator line. A separator line is defined, for example, as a line containing the same repeating character (e.g., “--------------”). For each block identified by the method according to an embodiment of the present invention, the method also identifies field breaks (e.g., column breaks) in the block. For every set of, for example, five or more contiguous lines within the block for which two or more columns are identified, a database table is created.
Similar to the processing of HTML tables, processing is performed on the text tables so that the data contained therein can be used in a database representation. For example, all HTML escape sequences are translated to their corresponding ASCII representation. Any carriage returns and/or line fields are removed from the data. All HTML tags are removed from the data except for <BR> tags which are translated into a carriage return/line feed (<CR> <LF>) line break in the data. Also, leading and trailing white spaces are removed from the data and all internal white spaces are compressed to a single space. The method according to an embodiment of the present invention also provides a user definable HeaderRows=n parameter which allows the user to specify that the top n rows of the text block are treated as column titles and are used to create the database table column names (e.g., a default behavior could be to use all of the text block as data—HeaderRows=0). Similarly, a user definable SkipBlankRow=Yes parameter allows a user to specify if blank rows are to be removed from the data (e.g., a default behavior could be that blank rows are skipped—SkipBlankRow=Yes). After a block has been processed, scanning resumes on the next line past the block and the process resumes until the entire file has been scanned.
The entire database created by the method according to an embodiment of the present invention (e.g., based on the HTML tables and text tables contained on the source file) is then reviewed and any blank tables are removed from the database. Blank tables can occur, for example, because HTML tables are often used as a page layout mechanism and not as a columnar data delivery mechanism. All of the tablenames in the database are also reviewed to make sure they are unique, without regard to case. If there are conflicting names, they are modified to make them unique (e.g., if there are two tables na

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for processing a file to generate a database does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for processing a file to generate a database, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for processing a file to generate a database will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2508904

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.