System and method for discovering schematic structure in...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C709S246000

Reexamination Certificate

active

06738767

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of Invention
The present invention generally relates to document processing. The method and apparatus of the present invention have particular application to extracting schematic information from a set of documents.
2. Discussion of Prior Art
The world wide web (throughout this specification,web, www, and world wide web are used interchangeably) is presently growing at an average of 1 million pages per day and is an amazing source of information. All of this information is buried in HTML documents authored by a wide variety of people with differing skills, culture, and purpose and using a wide variety of tools to author these pages. HTML does give structure to the authored documents, but mainly for viewing purposes. HTML has a fixed set of tags which are mostly used to enhance the visual appeal of the documents. Thus, it often happens that HTML pages for the same purpose have a different set of tags. For instance, people often mark-up their resumes in HTML on their home pages. Depending on the styles, these resume documents look significantly different from each other. This difference is acceptable for viewing purposes, but presents great difficulties to automatic programs which try to extract pertinent information from them.
The web is not just used for browsing anymore; automatic programs, like web crawlers, visit web sites and extract information to serve search engines, or push engines. Comparison shopping engines visit web sites describing similar information, such as prices, and extract semantic information from these sites. Given the format variances possible between topically related web pages, retrieved data is often unhelpful, unrelated or difficult to extract.
However, the present invention addresses this need to build search engines that allow users to formulate structural queries like “find a student with a Master's degree and a GPA of 3.5 or more and skills in Java.” The present invention allows the extraction of structural information buried in HTML pages which cater to the same topic but are authored with significantly different styles.
Some specific prior art related to the present invention is discussed below. These references describe methods of investigating the structure of documents and retrieving documents from large databases in response to user queries.
Two articles which describe attempts to discover structure from semi-structured data are “Identifying Aggregates in Hypertext Structures”, Proceedings of ACM Hypertext '91, pp. 63-74, and “Structural Properties of Hypertext”, Proceedings of the Ninth ACM Conference on Hypertext, pp. 180-187, 1998. These attempts, however, focus on the organization of a set of hypertext documents by following their links rather than considering the schematic nature of the individual documents.
Three other articles describing related investigations are: “Inferring Structure in Semistructured Data”, Workshop on Management of Semistructured Data, 1997; “Extracting Schema from Semistructured Data”, SIGMOD98, pp. 295-306; and “Discovering Association of Structure from Semistructured Objects”, IEEE Trans. on Knowledge and Data Engineering, 1999. However, these articles do not consider the schematic structure of individual documents or documents which have different schemas.
The patent to Driscoll (U.S. Pat. No. 5,694,592) teaches a method of querying and retrieving documents from a database using semantic knowledge about the query string to determine document relevancy.
The patent to Ishikawa (U.S. Pat. No. 5,848,407) describes a method of presenting potentially related hypertext document summaries to a user who is using a search engine that indexes a plurality of hypertext documents.
Whatever the precise merits and features of the prior art in this field, the earlier art does not achieve or fulfill the purposes of the present invention. The prior art does not provide for automatically identifying schematic structural and tag information from HTML documents and then converting these documents according to the extracted information.
SUMMARY OF THE INVENTION
The present invention describes a system and method that extracts keywords and structural information from hypertext or mark-up language documents (e.g. HTML) and then reformulates them as documents with a common structure and common set of tags. One underlying goal is to convert a collection of HTML documents, written in different styles, into XML documents following a common schema. XML, eXtended Markup Language, defines a web standard for describing schemas for different domains. For instance, one domain might be resumes, and a schema can be defined for describing all resumes. Thus all resume documents are written using the structure and tags described by this schema. Thereafter, keyword based search engines will be able to support queries and retrieve documents that are schematically and semantically closer to the information users are looking for. Using a five-stage process, the common schematic structures are discovered for the set of HTML documents authored in various styles. Prior domain knowledge regarding punctuation, keywords, synonyms and HTML tags is used to 1) break a document up into separate objects, 2) identify the objects corresponding to keywords, 3) regroup objects into hierarchical layers of abstraction, 4) logically order objects at the same level of abstraction, and finally 5) remove any non-keyword related information from the document's discovered schematic structure. The discovered schema supports structural queries from search engines that locate data that are more semantically related to the requested information than data located by simple keyword searching.


REFERENCES:
patent: 5317686 (1994-05-01), Salas et al.
patent: 5694592 (1997-12-01), Driscoll
patent: 5848407 (1998-12-01), Ishikawa et al.
patent: 5970490 (1999-10-01), Morgenstern
patent: 6336124 (2002-01-01), Alam et al.
patent: 6519617 (2003-02-01), Wanderski et al.
patent: 6523022 (2003-02-01), Hobbs
“Identifying Aggregates in Hypertext Structures”, ACM Conference on Hypertext, 3rdSan Antonio, Dec. 15-18, 1991, pp. 63-74.
“Structural Properties of Hypertext”, ACM Conference on Hypertext and Hypermedia, 9th, Pittsburgh, Jun. 20-24, 1998, Proceedings, pp. 180-187.
“Inferring Structure in Semistructured Data”, SIGMOND Record, vol. 26, issue 4, Dec. 1997, pp. 39-43.
“Extracting Schema from Semistructured Data”, SIGMOND Record, vol. 27, issue 2, Jun. 1998, pp. 295-306.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for discovering schematic structure in... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for discovering schematic structure in..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for discovering schematic structure in... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3189087

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.