System and method for the automatic mining of...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06385629

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to the field of data mining, and particularly to a software system and associated method for identifying a set of related information on the World Wide Web. More specifically, the present invention relates to the automatic and iterative mining of acronyms and their expansions through patterns of occurrences and formation rules using a duality concept.
BACKGROUND OF THE INVENTION
The World Wide Web (WWW) is a vast and open communications network where computer users can access available data, digitally encoded documents, books, pictures, and sounds. With the explosive growth and diversity of WWW authors, published information is oftentimes unstructured and widely scattered. Although search engines play an important role in furnishing desired information to the end users, the organization of the information lacks structure and consistency. Web spiders crawl web pages and index them to serve the search, engines. As the web spiders visit web pages, they could look for, and learn pieces of information that would otherwise remain undetected.
Current search engines are designed to identify pages with specific phrases and offer limited search capabilities. For example, search engines cannot search for phrases that relate in a particular way, such as books and authors. Bibliometrics involves the study of the world of authorship and citations. It measures the co-citation strength, which is a measure of the similarity between two technical papers on the basis of their common citations. Statistical techniques are used to compute this measures. In typical bibliometric situations the citations and authorship are explicit and do not need to be mined. One of the limitations of the bibliometrics is that it cannot be used to extract buried information in the text.
Exemplary bibliometric studies are reported in: R. Larson, “Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace,” Technical report, School of Information Management and Systems, University of California, Berkeley, 1996. http://sherlock.sims.berkeley.edu/docs/asis96/asis96.html; K. McCain, “Mapping Authors in Intellectual Space: A technical Overview,” Journal of the American Society for Information Science, 41(6):433-443, 1990. A Dual Iterative Pattern Relation Expansion (DIPRE) method that addresses the problem of extracting (author, book) relationships from the web is described in S. Brin, “Extracting Patterns and Relations from the World Wide Web,” WebDB, Valencia, Spain, 1998.
Another area to identify a set of related information on the World Wide Web is the Hyperlink-Induced Topic Search (HITS). HITS is a system that identifies authoritative web pages on the basis of the link structure of web pages. It iteratively identifies good hubs, that is pages that point to good authorities, and good authorities, that is pages pointed to by good hub pages. This technique has been extended to identify communities on the web, and to target a web crawler. One of HITS' limitations resides in the link topology of the pattern space, where the hubs and the authorities are of the same kind. i.e., they are all web pages. HITS is not defined in the text of web pages in the form of phrases containing relations in specific patterns.
Exemplary HITS studies are reported in: D. Gibson et al., “Inferring Web Communities from Link Topology,” HyperText, pages 225-234, Pittsburgh, Pa., 1998; J. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” Proc. of 9th ACM-SIAM Symposium on Discrete Algorithms, May 1997; R. Kumar, “Trawling the Web for Emerging Cyber-Communities,” published on the WWW at URL: http://www8.org/w8-papers/4a-search-mining/trawling/trawling.html) as of Nov. 13, 1999; and S. Chakrabarti et al. “Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery,” Proc. of The 8
th
International World Wide Web Conference, Toronto, Canada, May 1999.
The problem of information organization and lack of structure and consistency is further exasperated in technical and other fields that are acronym driven. The diversity and non-uniformity in the use of acronyms would oftentimes obscure the understanding of the subject matter being described, unless clear expansions are provided to the readers.
There is therefore a great and still unsatisfied need for a software system and associated method for automatically identifying and mining acronym-expansion pairs on the World Wide Web, using the duality concept and strict formation rules for quality. enhancement.
SUMMARY OF THE INVENTION
In accordance with the present invention, a computer program product is provided as an automatic mining system to identify a set of related information on the WWW using a duality concept. Duality problems, arise, for example, when a user attempts to identify a pair of related phrases such as (book, author); (name, email); (acronym, expansion); or similar other relations. The mining system addresses the duality problems by iteratively refining mutually dependent approximations to their identifications. Specifically, the mining system iteratively refines (i) pairs of phrases related in a specific way; (ii) the patterns: of their occurrences in web pages, i.e., the ways in which the related phrases are marked in the web pages; and (iii) the formation rules.
In one embodiment, the automatic mining system addresses a particular paradigmatic duality problem, namely identifying (acronym, expansion) pairs in terms of the patterns of their occurrences in the web pages. The solution to this problem involves two mutually dependent duality problems: The first being the duality between the related pairs and their patterns, and the second being the duality between the related pairs and the acronym formation rules. The automatic mining system runs in an iterative fashion for continuously and incrementally refining the sets of (acronym, expansion) pairs, patterns, and formation rules.
The automatic mining system is generally comprised of a database and three identifiers: a formation rule identifier, an acronym-expansion pair identifier, and a pattern identifier. The database contains the (acronym, expansion) pairs R
i−1
that have already been identified by the acronym-expansion pair identifier; the patterns P
i−1
that have already been identified by the pattern identifier; and the sets of formation rules that have already been identified by the formation rule identifier. Initially, the database begins with small seed sets of (acronym, expansion) pairs R
0
, patterns P
0
, and formation rules E
0
, that are continuously and iteratively broadened by the automatic mining system.


REFERENCES:
patent: 5745360 (1998-04-01), Leone et al.
patent: 5819260 (1998-10-01), Lu et al.
patent: 5857179 (1999-01-01), Vaithyanathan et al.
R. Larson, “Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace,” the Proceedings of the 1966 American Society for Information Science Annual Meeting, also published as a technical report, School of Information Management and Systems, University of California, Berkeley, 1996, which is published on the Word Wide Web at URL; http://sherlock.sims.berkeley.edu/docs/asis96/asis96.html.
D. Gibson et al., “Inferring Web Communities from Link Topology,” Proceedings of the 9thACM. Conference on Hypertext and Hypermedia, Pittsburgh, PA, 1998.
D. Turnbull, “Bibliometrics and the World Wide Web,” Technical Report University of Toronto, 1996.
K. McCain, “Mapping Authors in Intellectual Space: A technical Overview,” Journal of the American Society for Information Science, 41(6):433-443, 1990.
S. Brin, “Extracting Patterns and Relations from the World Wide Web,” WebDB, Valencia, Spain, 1998.
R. Agrawal et al., “Fast Algorithms for Mining Association Rules,” Proc. of the 20th Int'l Conference on VLDB, Santiago, Chile, Sep. 1994.
R. Agrawal et al., Mining Association Rules Between Sets of Items in Large Databases, Proceedings of ACM SIGMOD Conference on Management of Data, pp. 207-216, Washi

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for the automatic mining of... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for the automatic mining of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for the automatic mining of... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2819592

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.