Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1999-09-29
2001-09-04
Homere, Jean R. (Department: 2777)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000, C707S960000
Reexamination Certificate
active
06286018
ABSTRACT:
FIELD OF THE INVENTION
The present invention is related to the field of analysis of linked collections of documents, and in particular to predicting the relevance of documents in the linked collection based on citation structures.
BACKGROUND OF THE INVENTION
The ever-increasing universe of electronic information, for example as found on the World Wide Web (herein after referred to as the Web), competes for the effectively fixed and limited attention of people. Both consumers and producers of information want to understand what kinds of information are available, how desirable it is, and how its content and use change through time.
Making sense of very large collections of linked documents and foraging for information in such environments is difficult without specialized aids. Collections of linked documents are often connected together using hypertext links. The basic structure of linked hypertext is designed to promote the process of browsing from one document to another along hypertext links, which is unfortunately very slow and inefficient when hypertext collections become very large and heterogeneous. Two sorts of aids have evolved in such situations. The first are structures or tools that abstract and cluster information in some form of classification system. Examples of such would be library card catalogs and the Yahoo! Web site (URL http://www.yahoo.com). The second are systems that attempt to predict the information relevant to a user's needs and to order the presentation of information accordingly. Examples would include search engines such as Lycos (URL: http://www.lycos.com), which take a user's specifications of an information need, in the form of words and phrases, and return ranked lists of documents that are predicted to be relevant to the user's need.
Another system which provides aids in searching for information on the Web is the “Recommend” feature provided on the Alexa Internet Web site (URL: http://www.alexa.com). The “Recommend” feature provides a list of related Web pages that a user may want to retrieve and view based on the Web page that they are currently viewing.
It has been determined that one way to facilitate information seeking is through automatic categorization of Web Pages. One technique for categorization of Web pages is described by P. Pirolli, J. Pitkow and R. Rao in the publication entitled
Silk from a Sow's Ear: Extracting Usable Structures from the Web
, Conference on Human Factors in Computing Systems (CHI 96), Vancouver British Columbia, Canada, April 1996. Described therein is a categorization technique wherein each Web page is represented as a feature vector, with features extracted from information about text-content similarity, hypertext connections, and usage patterns. Web pages belonging to the same category, may then be clustered together. Categorization is computed based on inter-document similarities among these feature vectors.
Another aid for making sense of such collections is clustering. One way to approach the automatic clustering of linked documents is to adapt the existing approaches of clustering standard text documents. Such an approach is described by Cutting et al., in the publication entitled “Scatter/gather: A cluster based approach to browsing large document Collections”, The 15
th
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 318-329, August 1992. However, there are several impracticalities with such existing text-clustering techniques. Text-based clustering typically involves computing inter-document similarities based on content-word frequency statistics. Not only is this often expensive, but, more importantly, its effectiveness was developed and tuned on human-readable texts. It appears, though, that the proportion of human-readable source files for Web pages is decreasing with the infusion of dynamic and programmed pages.
Another option for performing clustering of document collections is to look at usage patterns. Unfortunately, any clustering based on usage patterns requires access to data that is not usually recorded in any easily accessible format. In the case of the Web, while a moderate amount of usage information is recorded for each requested document at a particular Web site, the log files for other sites are not publicly accessible. Thus while the usage for a particular site can be ascertained, this information is not available for the other 500,000 Web sites that currently exist.
Other attempts at clustering hypertext typically utilize the hypertext link topology of the collection. Such techniques are described by R. A. Botafogo, E. Rivlin, and B. Schneiderman,
Structural Analysis of Hypertexts: Identifying Hierarchies And Useful Metrics
, ACM Transactions on Information Systems, 10(2):142-180, 1992. Such a basis for clustering makes intuitive sense since the links of a particular document represent what the author felt was of interest to the reader of the document. These known clustering methods have been applied to collections with several hundred elements, and do not seem particularly suited to scale gracefully to large heterogeneous collections like the Web, where it has been estimated that there are over 70 million text-based documents which currently exist.
Other publications relevant to the invention of the present application:
Larson, Ray R.,
Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace
, Proceedings of 59
th
ASIS Annual Meeting held in Baltimore Maryland, edited by Steve Hardin, Vol. 33:71-78, Information Today Inc., 1996.
SUMMARY OF THE INVENTION
A method and apparatus combining spreading activation and citation analysis techniques to find related documents in collections of linked documents is disclosed. Spreading activation is an analysis technique that may be used to find documents relevant to a set of focus documents. Citation analysis is an analysis technique used to indicate a reference or link relationship amongst the documents in a collection of linked documents. The results of citation analysis are used in spreading activation as an indicator of the strength of association amongst the documents in the document collection. When spreading activation is performed an indication of documents relevant to the set of focus documents, based on how documents or referenced or linked, is obtained.
The method of the present invention is generally comprised of the steps of: generating initial activation information, said initial activation information indicating a set of focus documents in said collection of linked documents; generating citation information from the documents in said collection of linked documents, said citation information indicating a strength of association between documents in said collection of linked documents; generating link probability information from said usage data, said link probability information indicating a distribution of the number of documents a user will access in said collection of linked documents; performing a spreading activation operation based on said initial activation information, citation information and said probability information based on a network representation of said collection of linked documents; and extracting said document relevance information resulting from said spreading activation step when a stable pattern of activation across all nodes of said network representation of said collection of linked documents is reached.
REFERENCES:
patent: 5418948 (1995-05-01), Turtle
patent: 5568640 (1996-10-01), Nishiyama et al.
patent: 5594897 (1997-01-01), Goffman
patent: 5668988 (1997-09-01), Chen et al.
patent: 5675819 (1997-10-01), Schuetze
patent: 5717922 (1998-02-01), Hohensee et al.
patent: 5754939 (1998-05-01), Herz et al.
patent: 5819258 (1998-10-01), Vaithyanathan et al.
patent: 5870552 (1999-02-01), Dozier et al.
patent: 5895470 (1999-04-01), Pirolli et al.
patent: 5920859 (1999-07-01), Li
Botafogo et al., “Structural Analysis of Hypertexts: Identifying Hierarchies and Useful Metrics”,ACM T
Pirolli Peter L.
Pitkow James E.
Homere Jean R.
Xerox Corporation
LandOfFree
Method and apparatus for finding a set of documents relevant... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for finding a set of documents relevant..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for finding a set of documents relevant... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2469762