Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1999-11-15
2003-01-07
Channavajjala, Srirama (Department: 2177)
Data processing: database and file management or data structures
Database design
Data structure types
Reexamination Certificate
active
06505197
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to the field of data mining, and particularly to a software system and associated method for identifying a set of related information on the World Wide Web. More specifically, the present invention relates to the automatic and iterative mining and refinement of patterns of occurrences and relations using a duality concept.
2. Description of Related Art
The World Wide Web (WWW) is a vast and open communications network where computer users can access available data, digitally encoded documents, books, pictures, and sounds. With the explosive growth and diversity of WWW authors, published information is oftentimes unstructured and widely scattered. Although search engines play an important role in furnishing desired information to the end users, the organization of the information lacks structure and consistency. Web spiders crawl web pages and index them to serve the search engines. As the web spiders visit web pages, they could look for, and learn pieces of information that would otherwise remain undetected.
Current search engines are designed to identify pages with specific phrases and offer limited search capabilities. For example, search engines cannot search for phrases that relate in a particular way, such as books and authors. Bibliometrics involves the study of the world of authorship and citations. It measures the co-citation strength, which is a measure of the similarity between two technical papers on the basis of their common citations. Statistical techniques are used to compute this measures. In typical bibliometric situations the citations and authorship are explicit and do not need to be mined. One of the limitations of the bibliometrics is that it cannot be used to extract buried information in the text.
Exemplary bibliometric studies are reported in: R. Larson, “Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace,” Technical report, School of Information Management and Systems, University of California, Berkeley, 1996. http://sherlock.sims.berkeley.edu/docs/asis96/asis96.html; K. McCain, “Mapping Authors in Intellectual Space: A technical Overview,” Journal of the American Society for Information Science, 41(6):433-443, 1990. A Dual Iterative Pattern Relation Expansion (DIPRE) method that addresses the problem of extracting (author, book) relationships from the web is described in S. Brin, “Extracting Patterns and Relations from the World Wide Web,” WebDB, Valencia, Spain, 1998.
Another area to identify a set of related information on the World Wide Web is the Hyperlink-Induced Topic Search (HITS). HITS is a system that identifies authoritative web pages on the basis of the link structure of web pages. It iteratively identifies good hubs, that is pages that point to good authorities, and good authorities, that is pages pointed to by good hub pages. This technique has been extended to identify communities on the web, and to target a web crawler. One of HITS' limitations resides in the link topology of the pattern space, where the hubs and the authorities are of the same kind. i.e., they are all web pages. HITS is not defined in the text of web pages in the form of phrases containing relations in specific patterns. Exemplary HITS studies are reported in: D. Gibson et al., “Inferring Web Communities from Link Topology,” HyperText, pages 225-234, Pittsburgh, Pa., 1998; J. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” Proc. of 9th ACM-SIAM Symposium on Discrete Algorithms, May 1997; R. Kumar, “Trawling the Web for Emerging Cyber-Communities,” published on the WWW at URL: http://www8.org/w8-papers/4a-search-mining/trawling/trawling.html) as of Nov. 13, 1999; and S. Chakrabarti et al. “Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery,” Proc. of The
8
th
International World Wide Web Conference, Toronto, Canada, May 1999.
There is therefore a great and still unsatisfied need for a software system and associated method for automatically identifying and mining sets of related information on the World Wide Web, using the duality concept for quality enhancement.
SUMMARY OF THE INVENTION
In accordance with the present invention, a computer program product is provided as an automatic mining system to identify a set of related information on the WWW, with a high degree of confidence, using a duality concept. Duality problems arise, for example, when a user attempts to identify a pair of related phrases such as (book, author); (name, email); (acronym, expansion); or similar other relations. The mining system addresses the duality problems by iteratively refining mutually dependent approximations to their identifications. Specifically, the mining system iteratively refines (i) pairs of terms that are related in a specific way, and (ii) the patterns of their occurrences in web pages, i.e., the ways in which the related phrases are marked in the web pages. The automatic mining system runs in an iterative fashion for continuously and incrementally refining the patterns and patterns.
The automatic mining system includes a computer program product such as a software package, which is generally comprised of a database and two identifiers: a relation identifier and a pattern identifier. The database contains the previously identified pairs or sets of relations R
i−1
that have been identified by the relation identifier, and the set of patterns P
i−1
that have already been identified by the pattern identifier. Initially, the database begins with small seed sets of relations R
0
and patterns P
0
that are continuously and iteratively broadened by the automatic mining system.
REFERENCES:
patent: 5745360 (1998-04-01), Leone et al.
patent: 5809499 (1998-09-01), Wong et al.
patent: 5819260 (1998-10-01), Lu et al.
patent: 5832182 (1998-11-01), Zhang et al.
patent: 5857179 (1999-01-01), Vaithyanathan et al.
patent: 5987446 (1999-11-01), Corey et al.
patent: 6044366 (2000-03-01), Graffe et al.
patent: 6101515 (2000-08-01), Wical et al.
patent: 6122647 (2000-09-01), Horowitz et al.
patent: 6278997 (2001-08-01), Agrawal et al.
patent: 0304191 (1988-08-01), None
Krishnapuram, R et al., A fuzzy relative of the k-methods algorithm with application to web document clustering, Fuzzy system conference proceedings, Aug. 1999, pp. 22-25.*
Arimura, H et al., Text data mining: discovery of important keywords in the cyberspace, Digital Libries: Research and Practice, 2000 Kyoto conference, Nov. 2000, pp. 220-226.*
Chakrabarti, S. et al., Mining the Web's link structure, Computer, Aug. 1999, pp. 60-67.*
Ullman, J.D. The MIDAS data-mining project at Stratford, database engineering and applications, Aug. 1999 IDEAS international symposium proceedings, pp. 460-464.*
Ahonen, H, et al., Applying data mining techniques for descriptive phrase extraction in digital collections, Research and technology advances in digital libriary, proceedings, Apr., 1998, pp. 2-11.*
Sergey Brin, Extracting Patterns and relations from the world wide web, The world wide web and databases, International workshop WebDB Mar. 1998, 12 pages.*
R. Larson, “Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace,” Proceedingss of the 1996 American Society for Information Science Annual Meeting, also published as a technical report, School of Information Management and Systems, University of California, Berkeley, 1996, which is published on the World Wide Web at URL: http://sherlock.sims.berkeley.edu/docs/asis96/asis96.html.
D. Gibson et al., “Inferring Web Communities fom Link Topology,” Proceedings of the 9thACM. Conference on Hypertext and Hypermedia, Pittsburgh, PA, 1998.
D. Turnbull. “Bibliometrics and the World Wide Web,” Technical Report University of Toronto, 1996.
K. McCain, “Mapping Authors in Intellectual Space: A technical Overview,” Journal of the American Society for Information Science, 41(6):433-443, 1990.
S. Brin, “Extracting Patterns and Relations from the World Wide Web
Sundaresan Neelakantan
Yi Jeonghee
Channavajjala Srirama
Kassatly Samuel A.
LandOfFree
System and method for automatically and iteratively mining... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for automatically and iteratively mining..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for automatically and iteratively mining... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3060368