Clustering hypertext with applications to web searching

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C715S252000, C715S252000

Reexamination Certificate

active

06684205

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to search programs and more particularly to an improved search method and system which clusters hypertext documents.
2. Description of the Related Art
The World-Wide-Web has attained a gargantuan size (Lawrence, S., and Giles, C. L. Searching the World Wide Web. Science 280, 5360 (1998), 98., incorporated herein by reference) and a central place in the information economy of today. Hypertext is the lingua franca of the web. Moreover, scientific literature, patents, and law cases may be thought of as logically hyperlinked. Consequently, searching and organizing unstructured collections of hypertext documents is a major contemporary scientific and technological challenge.
Given a “broad-topic Query” (Kleinberg, J. Authoritative sources in a hyperlinked environment, in ACM-SIAM SODA (1998), incorporated herein by reference), a typical search engine may return a large number of relevant documents. Without effective summarization, it is a hopeless and enervating task to sort through all the returned documents in search of high-quality, representative information resources. Therefore, there is a need for an automated system that summarizes the large volume of hypertext documents returned during internet searches.
SUMMARY OF THE INVENTION
It is, therefore, an object of the present invention to provide a structure and method for searching a database of documents comprising performing a search of the database using a query to produce query result documents, constructing a word dictionary of words within the query result documents, pruning function words from the word dictionary, forming first vectors for words remaining in a word dictionary, constructing an out-link dictionary of documents within the database that are pointed to by the query result documents, adding the query result documents to the out-link dictionary, pruning documents from the out-link dictionary that are pointed to by fewer than a first predetermined number of the query result documents, forming second vectors for documents remaining in the out-link dictionary, constructing an in-link dictionary of documents within the database that point to the query result documents, adding the query result documents to the in-link dictionary, pruning documents from the in-link dictionary that point to fewer than a second predetermined number of the query result documents, forming third vectors for documents remaining in the in-link dictionary, normalizing the first vectors, the second vectors, and the third vectors to create vector triplets for document remaining in the in-link dictionary and the out-link dictionary, clustering the vector triplets using the following four step process of the toric k-means process:
(a) arbitrarily segregate the vector triplets into clusters,
(b) for each cluster, computing a set of concept triplets describing the cluster,
(c) re-segregate the vector triplets into more coherent set of clusters obtained by putting each vector triplet into the cluster corresponding to the concept triplet that is closest to, that is, most similar to, the given vector triplet,
(d) repeating steps (b)-(c) until coherence of the obtained clusters no longer significantly increases, and the process concludes by annotating the clusters using nuggets of information, the nuggets including summary, breakthrough, review, keyword, citation, and reference.
The summary comprises a document in a cluster having a most typical in-link feature vector amongst all the documents in the cluster. The breakthrough comprises a document in a cluster having a most typical in-link feature vector amongst all the documents in the cluster. The review comprises a document in a cluster having a most typical out-link feature vector amongst all the documents in the cluster. The keyword comprises a word in a word dictionary for the cluster that has the largest weight. The citation comprises a document in a cluster representing a most typical in-link into a cluster. The reference comprises a document in a cluster representing a most typical out-link out of a cluster.


REFERENCES:
patent: 5787420 (1998-07-01), Tukey et al.
patent: 5787421 (1998-07-01), Nomiyama
patent: 5819258 (1998-10-01), Vaithyanathan et al.
patent: 5835905 (1998-11-01), Pirolli et al.
patent: 5857179 (1999-01-01), Vaithyanathan et al.
patent: 5864845 (1999-01-01), Voorhees et al.
patent: 5895470 (1999-04-01), Pirolli et al.
patent: 5920859 (1999-07-01), Li
patent: 6012058 (2000-01-01), Fayyad et al.
patent: 6038574 (2000-03-01), Pitkow et al.
patent: 6115708 (2000-09-01), Fayyad et al.
patent: 6122647 (2000-09-01), Horowitz et al.
patent: 6256648 (2001-07-01), Hill et al.
patent: 6298174 (2001-10-01), Lantrip et al.
patent: 6363379 (2002-03-01), Jacobson et al.
patent: 6389436 (2002-05-01), Chakrabarti et al.
patent: 6460036 (2002-10-01), Herz
patent: 6556983 (2003-04-01), Altschuler et al.
Weiss et al., HyPursuit: A hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering—In Proceedings of Hypertext 1996, Wahington, DC, USA, pp. 180-193.*
“Structuring and Visualizing the WWW by Generalised Similarity Analysis”, Chaomei Chen, In proceedings of Hypertext 1997 (Southampton, England, Apr. 1997), pp. 177-186.
“Interactive Clustering for Navigating in Hypermedia Systems”, Sougata Mukherjea, James D. Foley, Scott E. Hudson, ACM Press, 1994.
“From Latent Sematics to Spatial Hypertext An Integrated Approach”, Chaomei Chen, Mary Czerwinski, In Proceedings of Hypertext 1998, Pittsburgh, PA, USA, 1998.
“HyPursuit: A Hierarchial Network Search Engine that Exploits Content-Link Hypertext Clustering”, Ron Weiss, Bienvenido Velez, Mark A. Sheldon, Chanathip Namprempre, Peter Szilagyi, Andrzej Duda, David K. Gifford, In Proceedings of Hypertext 1996, Washington, DC, USA, pp. 180-193.
“Information Retrieval Data Structures & Algorithms”, William B. Frakes, Ricardo Baeza-Yates, Prentica Hall PTR, Upper Saddle River, New Jersey, 1992.
Dhillon, I.S. Modha, D.S., “Concept Decompositions For Large Sparse Text Data Using Clustering”, Jul. 8, 1999, pp. 1-32.
Silverstein, C., Henzinger, M., Marais, H., Moricz, M., “Analysis of a Very Large Alta Vista Query Log”, SRC Technical Note 1998-014, Oct. 26, 1998, pp. 1-17.
Chakrabarti, S., Dom, B., Indyk, P., “Enhanced Hypertext Categorization Using Hyperlinks”, ACM SIGMOND 1998, Seattle, Washington, pp. 1-12.
Kleinberg, Jon M., “Authoritative Sources in a Hyperlinked Environment”, Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 1998, IBM Research Report RJ 10076, May 1997, pp. 1-33.
Lawrence, Steve and Giles, C. Lee, “Searching the World Wide Web”, Science, vol. 280, Apr. 3, 1998, pp. 98-100.
Larson Ray R., “Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace”, Proceeding of the 1996 American Society for Information Science Annual Meeting, pp. 1-13.
Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalen, S., Gibson, D.; Kleinberg, J., “Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text”, WWW7, 1998, pp. 1-14.
Bradley, P.S. and Fayyad, Usama M., “Refining Initial Points for K-Means Clustering”, ICML, 1998, pp. 91-99.
Chakrabarti, S. Dom. B.E., Kumar, S.R., Raghayan P., Rajagopalan S., Tomkins, A., Kleinberg, J.M., and Gibson, D., “Hypersearching the Web”, Scientific American, Jun. 1999, pp. 1-8.
Weiss, R., Velez, B., Sheldon, M.A., Namprempre, C., Szilagyi, P., Duda, A., Gifford, D.K., “Hypursuit: A Hierarchical Network Search Engine That Exploits Content-Link Hypertext Clustering”, ACM Hypertext, 1996, pp. 180-193.
Mukherjea, S., Foley, J.D., Hudson, S.E., “Interactive Clustering for Navigating in Hypermedia Systems”, ACM Hypertext, Sep. 1994, pp. 136-145.
Chen, C., “Structuring and Visualising the Web by Generalised Similarity Analysis”, ACM Hypertext, 1997.
Pirolli, P., Pitkow, J., Rao, R., “Silk From A Sow's Ear: Extracting Usable Structures From the Web”, ACM, SIGCHI Human Factors Comput., 1996.
Chen, C., Czerwinski, M., “From Latent Semant

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Clustering hypertext with applications to web searching does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Clustering hypertext with applications to web searching, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Clustering hypertext with applications to web searching will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3256414

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.