Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-06-30
2001-03-06
Feild, Joseph H. (Department: 2776)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06199081
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to the field of software and, in particular, to methods and systems for retrieving data from network sites and processing that data according to its content.
BACKGROUND OF THE INVENTION
In recent years, there has been a tremendous proliferation of computers connected to a global network known as the Internet. A “client” computer connected to the Internet can download digital information from “server” computers connected to the Internet. Client application software executing on client computers typically accepts commands from a user and obtains data and services by sending requests to server applications running on server computers connected to the Internet. A number of protocols are used to exchange commands and data between computers connected to the Internet. The protocols include the File Transfer Protocol (FTP), the Hypertext Transfer Protocol (HTTP), the Simple ail Transfer Protocol (SMTP), and the “Gopher” document protocol.
The HTTP protocol is used to access data on the World Wide Web, often referred to as “the Web.” The World Wide Web is an information service on the Internet providing documents and links between documents. The World Wide Web is made up of numerous Web sites around the world that maintain and distribute Web documents. A Web site may use one or more Web server computers that store and distribute documents in one of a number of formats including the Hypertext Markup Language (HTML).
A HTML document contains text and tags. HTML documents may also contain metadata and metatags. Metadata is data about data and metatags define the meta-data. Examples of metatags that identify meta-data are “author,” “language,” and “character set.” HTML documents may also include tags that contain embedded “links” or “hyperlinks” that reference other data or documents located on the same or another Web server computer. The HTML documents and the document referenced in the hyperlinks may include text, graphics, audio, or video in various formats.
A Web browser is a client application that communicates with server computers via HTTP, FTP, and Gopher protocols. Web browsers receive Web documents from the network and present them to a user. Internet Explorer, available from Microsoft Corporation, Redmond, Wash., is an example of a popular Web browser application.
An intranet is a local area network containing Web servers and client computers operating in a manner similar to that of the World Wide Web described above. Typically, all of the computers on an intranet are contained within a company or organization.
Web crawlers are computer programs that automatically retrieve numerous Web documents from one or more Web sites. A Web crawler processes the received data, preparing the data to be subsequently processed by other programs. For example, a Web crawler may use the retrieved data to create an index of documents available over the Internet or an intranet. A “search engine” can later use the index to locate Web documents that satisfy a specified search criteria.
It is desirable to have a mechanism in the crawler that allows the crawler to feed to client applications, like an indexing engine, a stream of data not directly present in the “crawled” documents. Preferably, such a mechanism would have the ability to modify data retrieved from Web documents with active components in order to allow the retrieved data to be processed more efficiently and accurately by the client application. The mechanism of the invention would also preferably have the ability to exclude a document from being indexed based on its content and properties. The present invention is directed to providing such a mechanism.
SUMMARY OF THE INVENTION
The present invention discloses a method and system for modifying a document data stream obtained by a gatherer process when an electronic document is retrieved from a computer. The gatherer process retrieves Web documents from Web servers that are connected to a computer network commonly known as the Word Wide Web. Preferably, the Web crawler employs a filtering process to retrieve the document and to parse the document into a document data stream comprising contents and properties. For instance, when an HTML document is retrieved, the filtering process converts the document's text and tags to a uniform representation of the document's contents and properties. The document retrieval performed by the present invention is not limited to HTML documents. Many different document formats may be filtered to produce a uniform representation of contents and properties that are processed by the invention in the manner described below.
In accordance with the present invention, the retrieved contents and properties of a document are contained in a document data stream that is sequentially piped through one or more active plug-in components. The active plug-in components modify the document data stream by adding, deleting, or modifying the contents and properties of the document data stream. Active plug-ins are modeled in the invention as modular components, or “plug-ins,” that in an actual embodiment of the invention are software objects that can be plugged-in to a configuration entity called a gathering project. After the document data stream has been modified by the active plug-ins, the modified document's data stream is piped to one or more consumer plug-ins. A consumer plug-in is an application that processes the modified document data stream. The processing conducted by the consumer plug-in may be influenced by the modifications made to the original document data stream by the active plug-ins.
Both active plug-ins and consumer plug-ins can be mixed and matched and plugged-in to the gathering project according to the goals of the project. Active plug-ins are inserted before any consumer plug-ins so that they may modify the original document data stream in a way that makes the document data stream more useful to the consumer plug-ins that follow the active plug-in in the gathering project. The gathering project can also be configured not to use any active plug-ins, in which case all data contained in the original document data stream will be piped directly to the consumer plug-ins that are plugged-in to the project.
In accordance with other aspects of this invention, the gatherer process is an enhanced Web crawler that has one or more configuration entities called gathering projects. Each gathering project has its own transaction log, history map, plug-in list, and crawl restriction rules that the gatherer process uses to “crawl” Web documents that are stored on a plurality of Web servers connected to the World Wide Web. When the gatherer process retrieves a document, the gatherer process receives a copy of the content of the document, which may include data such as text, images, sound, and embedded properties.
An example of a client application that makes use of embedded properties is a Web browser that reads HTML tags embedded in a Web document to format the document and to specify hyperlinks to other Web documents. In addition to tags that provide formatting information, the document may also contain meta-tags, which are used to define meta-data in the document. For instance, a meta-tag “Author” may identify meta-data in the document that identifies the author of the document. Tags may either conform to “markup languages” such as HTML, SGML, XML and VRML, which are widely known to those skilled in the art, or tags can be defined as “extensions” to a markup language and embedded in documents for the use of specific client applications. An example of a client application that recognizes an extended set of property definitions is the Internet Explorer, a Web browser available from Microsoft Corporation, Redmond, Wash.
When the gatherer process retrieves a Web document, it first uses a filtering process to retrieve the Web document according to the appropriate protocol. The filter process then converts the text and tags retrieved from the document into a uniform representation of the document's conten
Meyerzon Dmitriy
Nichols William G.
Christensen O'Connor Johnson & Kindness PLLC
Feild Joseph H.
Microsoft Corporation
LandOfFree
Automatic tagging of documents and exclusion by content does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Automatic tagging of documents and exclusion by content, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Automatic tagging of documents and exclusion by content will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2468820