Data processing: presentation processing of document – operator i – Presentation processing of document – Layout
Reexamination Certificate
2000-02-25
2004-11-23
Hong, Stephen S. (Department: 2178)
Data processing: presentation processing of document, operator i
Presentation processing of document
Layout
C715S252000, C715S252000
Reexamination Certificate
active
06823492
ABSTRACT:
BACKGROUND
The present invention relates to indexing structures to facilitate computerized searches through data. More specifically, the present invention relates to a method and an apparatus for generating an index to facilitate searching through data within a document based upon a predefined index stylesheet associated with the document that contains instructions for creating an index for the document.
The explosive growth of the Internet has been strongly tied to the development of search engines that allow users to rapidly search through large volumes of textual data from thousands and even millions of different web sites. A user who is interested in a particular topic merely has to enter a number of keywords into a search engine in order to receive linkages to different web pages containing the key words.
Search engines typically create an “index” of documents (such as web pages) that are available on the world wide web. An index typically stores individual words (or other meaning carrying textual strings) in a more compact and easily searchable form known as “tokens.”
The process of building an effective index can be greatly complicated by the fact that documents can exist a wide variety of different forms which need to be indexed differently. For example, an efficient index for a technical paper might contain the abstract and tide of the technical paper, but not the body of the technical paper, whereas an efficient index for a television schedule might contain ratings for individual television programs.
The process of creating an index is also complicated by the fact that for common document formats, such the Hypertext Markup Language (HTML) or the Extensible Markup Language (XML), much of the important information for search purposes is stored within attribute fields, and is not within the normal text of the document.
Furthermore, the structure of a document may change over time, which can require the structure of the index to change. For example, suppose the structure of a product catalog is updated to include consumer reviews for individual products. This change may require the index to change to include these consumer reviews.
Existing systems create indexes for documents using ad hoc rules. For example, one ad hoc rule is to create an index for all textual information that is not within attribute fields. Unfortunately, such ad hoc rules often include much unimportant information in the index, and often exclude important information.
A similar problem exists in converting the document into tokens (tokenizing the document) during the index creation process. During the index creation process, relevant portions of a document are converted into tokens associated with individual meaning-carrying units of text, such as wordforms or numbers. In the English language, wordforms are typically delineated by white spaces and punctuation marks. Hence, the tokenizing process is relatively easy. In contrast, languages such as Japanese have no such delineation. Consequently, the tokenization process depends on contextual information and can be very complicated.
The tokenization process can also be domain dependent. For example, periods within an email address, such as “person.dept@companyx.com” are linking elements, whereas periods within other textual information typically delineate word and sentence boundaries.
Hence, the tokenization process varies between languages and between domains.
SUMMARY
One embodiment of the present invention provides a system that generates an index to facilitate searching through text within a document based upon an index stylesheet associated with the document. The system operates by receiving a document to be indexed and then parses the document to produce a parsed document. The system also retrieves instructions for creating the index for the document from an index stylesheet associated with the document. The system creates the index for the document by transforming the parsed document in a manner that is specified by the instructions retrieved from the index stylesheet.
In one embodiment of the present invention, retrieving the index stylesheet involves retrieving the index stylesheet across a network from a remote address.
In one embodiment of the present invention, the index stylesheet is appended to the document.
In one embodiment of the present invention, the system additionally makes the index available to a search engine so that the search engine is able to scan through the index.
In one embodiment of the present invention, the index stylesheet specifies sections of the document to skip in creating the index for the document.
In one embodiment of the present invention, the index stylesheet specifies attributes of the document that are to be included in the index.
In one embodiment of the present invention, the system receives additional documents to be indexed, and creates indexes for the additional documents using the index stylesheet.
In one embodiment of the present invention, creating the index for the document involves tokenizing the document by partitioning text within the document into individual meaning-carrying units of text.
In one embodiment of the present invention, prior to receiving the document, the system downloads and parses an index configuration file which specifies the index stylesheet to be used in creating the index.
In one embodiment of the present invention, the system receives the document from a client at an indexing server that creates the index for the client.
REFERENCES:
patent: 5471677 (1995-11-01), Imanaka
patent: 5710978 (1998-01-01), Swift
patent: 5819273 (1998-10-01), Vora et al.
patent: 5899975 (1999-05-01), Nielsen
patent: 5931940 (1999-08-01), Shelton et al.
patent: 5983248 (1999-11-01), DeRose et al.
patent: 6067543 (2000-05-01), Burrows
patent: 6067618 (2000-05-01), Weber
patent: 6076051 (2000-06-01), Messerly et al.
patent: 6119120 (2000-09-01), Miller
patent: 6154738 (2000-11-01), Call
patent: 6263332 (2001-07-01), Nasr et al.
patent: 6336117 (2002-01-01), Massarani
patent: 6587547 (2003-07-01), Zirngibl et al.
patent: 6591271 (2003-07-01), Ceri et al.
patent: 6675354 (2004-01-01), Claussen et al.
patent: 0 964 344 (1999-12-01), None
Publication, entitled “XSLT in document indexing,” by Jacek Ambroziak, XP-002165125, pp. 1-14.
Publication, entitled “Managing tokenizers in XML search,” by Jacek Ambroziak, XP-002165124, pp. 1-6.
Publication, entitled “XML tools and architecture for Named Entity Recognition,” by Andrei Mikheev, et al., XP-000863186, 1999, pp. 89-113.
Publication, entitled “Conceptually Assisted Web Browsing,” by Jacek Ambroziak, XP-002165122, pp. 1-7.
Publication, entitled “Acoi: A System for Indexing Multimedia Objects,” by Menzo Windhouwer, et al., XP-002165123, pp. 1-10.
Hong Stephen S.
Park Vaughan & Fleming LLP
Schlaifer Jonathan
Sun Microsystems Inc.
LandOfFree
Method and apparatus for creating an index for a structured... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for creating an index for a structured..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for creating an index for a structured... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3349690