Data processing: presentation processing of document – operator i – Presentation processing of document – Layout
Reexamination Certificate
1998-08-10
2003-12-16
Hong, Stephen S. (Department: 2178)
Data processing: presentation processing of document, operator i
Presentation processing of document
Layout
C715S252000, C707S793000
Reexamination Certificate
active
06665837
ABSTRACT:
FIELD OF THE INVENTION
This invention relates generally to computerized information retrieval, and more particularly to identifying related pages in a hyperlinked database environment such as the World Wide Web.
BACKGROUND OF THE INVENTION
It has become common for users of host computers connected to the World Wide Web (the “Web”) to employ Web browsers and search engines to locate Web pages having specific content of interest to users. A search engine, such as Digital Equipment Corporation's AltaVista search engine, indexes hundreds of millions of Web pages maintained by computers all over the world. The users of the hosts compose queries, and the search engine identifies pages that match the queries, e.g., pages that include key words of the queries. These pages are known as a “result set.”
In many cases, particularly when a query is short or not well defined, the result set can be quite large, for example, thousands of pages. The pages in the result set may or may not satisfy the user's actual information needs. Therefore, techniques have been developed to identify a smaller set of related pages.
In one prior art technique used by the Excite search engine, please see “http://www.excite.com,” users first form an initial query, using the standard query syntax for the Excite search engine that attempts to specify a topic of interest. After the result set has been returned, the user can use a “Find Similar” option to locate related pages. However, there the finding of the related pages is not fully automatic because the user first is required to form a query, before related pages can be identified. In addition, that technique only works on the Excite search engine and for the specific subset of Web pages, it provides related pages that are indexed by the Excite search engine.
In another prior art technique, an algorithm for connectivity analysis of a neighborhood graph (n-graph) is described by Kleinberg in “Authoritative Sources in a Hyperlinked Environment,” Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998, and also in IBM Research Report RJ 10076, May 1997, see, “http://www.cs.cornell.edu/Info/People/kleinber/auth.ps”. The Kleinberg algorithm analyzes the link structure, or connectivity of Web pages “in the vicinity” of the result set to suggest useful pages in the context of the search that was performed.
The vicinity of a Web page is defined by the hyperlinks that connect the page to others. A Web page can point to other pages, and the page can be pointed to by other pages. Close pages are directly linked, farther pages are indirectly linked via intermediate pages. This connectivity can be expressed as a graph where nodes represent the pages, and the directed edges represent the links. The vicinity of all the pages in the result set, up to a certain distance, is called the neighborhood graph.
Specifically, the Kleinberg algorithm attempts to identify “hub” pages and “authority” pages in the neighborhood graph for a user query. Hubs and authorities exhibit a mutually reinforcing relationship.
The Kleinberg paper cited above also describes an algorithm that can be used to determine related pages by starting with a single page. The algorithm works by first finding a set of pages that point to the page, and then running the base algorithm on the resulting graph. However, this algorithm for finding related pages differs from our invention in that it does not deal with popular URLs, with neighborhood graphs containing duplicate pages, or with cases where the computation is totally dominated by a single “hub” page, nor does the algorithm include an analysis of the contents of pages when it is computing the most related pages.
The CLEVER Algorithm is a set of extensions to Kleinberg's algorithm, see S.Chakrabarti et al, “Experiments in Topic Distillation,” ACM SIGIR Workshop on Hypertext Information Retrieval on the Web, Melbourne, Australia, 1998. The goal of the CLEVER algorithm is to distill the most important sources of information from a collection of pages about a topic.
In U.S. patent application Ser. No. 09/007,635 “Method for Ranking Pages Using Connectivity and Content Analysis” filed by Bharat et al. on Jan. 15, 1998, a method is described that examines both the connectivity and the content of pages to identify useful pages. However, the method is relatively slow because all pages in the neighborhood graph are fetched in order to determine their relevance to the query topic. This is necessary to reduce the effect of non-relevant pages in the subsequent connectivity analysis phase.
In U.S. patent application Ser. No. 09/058,577 “Method for Ranking Documents in a Hyperlinked Environment using Connectivity and Selective Content Analysis” filed by Bharat et al. on Apr. 9, 1998, now U.S. Pat. No. 6,112,203, a method is described which performs content analysis on only a small subset of the pages in the neighborhood graph to determine relevance weights, and pages with low relevance weights are pruned from the graph. Then, the pruned graphed is ranked according to a connectivity analysis. This method still requires the result set of a query to form a query topic.
Therefore, there is a need for a method for identifying related pages in a linked database that does not require a query and the fetching of many unrelated pages.
SUMMARY OF THE INVENTION
Provided is a method for identifying related pages among a plurality of pages in a linked database such as the World Wide Web. An initial page is selected from the plurality of pages by specifying the URL of the page or clicking on the page using a Web browser in a convenient manner.
Pages linked directly or indirectly to the initial page are represented as a neighborhood graph in a memory. The pages represented in the graph are scored on content using a similarity measurement using a topic extracted from a chosen subset of the represented pages.
A set of pages is selected from the pages in the graph, the selected set of pages having scores greater than a first predetermined threshold and do not belong to a predetermined list of “stop URLs.” Stop URLs are highly popular, general purpose sites such as search engines. The selected set of pages is then scored on connectivity, and a subset of the set of pages that have scores greater than a second predetermined threshold are selected as related pages. Finally, during an optional pass, content analysis can be done on highly ranked pages to determine which pages have high content scores.
REFERENCES:
patent: 5418948 (1995-05-01), Turle
patent: 5594897 (1997-01-01), Goffman
patent: 5724567 (1998-03-01), Rose et al.
patent: 5855015 (1998-12-01), Shoham
patent: 5895470 (1999-04-01), Pirolli et al.
patent: 5905863 (1999-05-01), Knowles et al.
patent: 5933823 (1999-08-01), Cullen et al.
patent: 5991713 (1999-11-01), Unger et al.
patent: 6073135 (2000-06-01), Broder et al.
patent: 6112202 (2000-08-01), Kleinberg
patent: 6112203 (2000-08-01), Bharat et al.
patent: 6115718 (2000-09-01), Huberman et al.
patent: 6138113 (2000-10-01), Dean et al.
patent: 6334145 (2001-12-01), Adams et al.
Guinan et al., Infromation Retrieval from Hypertext Using Dynamically Planned Guided Tours, ACM 1992, pp. 122-130.*
Salton et al., Selective Text Utilization and Text Traversal, ACM 1993, pp. 131-144.*
Chekuri et al, Web Search Using Automatic Classification, Google, Dec. 1996, pp. 1-11.*
Kleinberg, Authoritative Sources in a Hyperlinked Environment, Google, May 1997, pp. 668-677.
Broder Andrei Z.
Dean Jeffrey
Henzinger Monika R.
Brown & Raysman Millstein Felder & Steiner LLP
Hong Stephen S.
Huynh Cong-Lac
Overture Services Inc.
LandOfFree
Method for identifying related pages in a hyperlinked database does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for identifying related pages in a hyperlinked database, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for identifying related pages in a hyperlinked database will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3160455