Term-level text with mining with taxonomies

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06442545

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to extraction of information from databases, and specifically to text mining in unstructured databases.
BACKGROUND OF THE INVENTION
Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large quantities of data, and on the discovery of interesting patterns within them. While most work on KDD has been concerned with analyzing structured databases, there has been relatively little development of methods for analyzing the large quantity of information that is currently available only in unstructured, text-based form. An example of work in this latter category is described in “Mining Text Using Keyword Distributions,” by Ronen Feldman, Ido Dagan, and Haym Hirsh,
Proceedings of the
1995
Workshop on Knowledge Discovery in Databases,
which is incorporated herein by reference. Other work is described in “Finding Associations in Collections of Text,” by Ronen Feldman and Haym Hirsh,
Machine Learning and Data Mining: Methods and Applications,
edited by R. S. Michalski, I. Bratko, and M. Kubat, John Wiley & Sons, Ltd., 1997, which is also incorporated herein by reference.
A paper entitled “Technology Text Mining, Turning Information Into Knowledge: A White Paper from IBM,” edited by Daniel Tkach, Feb. 17, 1998, which is incorporated herein by reference, describes a program called IBM Intelligent Miner for Text, which extracts terms from unstructured text. “Terms,” in the context of the present application, are single words, or short strings of highly-related, linked words, such as “Biotechnology,” “New York Stock Exchange,” “Free market,” or “Health programs.” “Term extraction,” in the context of the present application, refers to the process of finding terms in a document that have relevance to the content of the document.
InQuery 5.0, produced by Sovereign Hill Software, uses term extraction to identify names of companies and people in one or more documents. The extracted terms are used to enable a search engine to find desired documents responsive to a user's query.
A paper entitled “Text Mining at the Term Level,” by Feldman et al.,
Proceedings of the
1998
Workshop on Knowledge Discovery in Databases,
August, 1998, which is incorporated herein by reference, the authors of which are the inventors of the present invention, describes a method for extracting terms from a document in a database, filtering out unimportant terms, and subsequently performing text mining in the database. “Text mining,” in the context of the present application, refers to a substantially automated process of extracting useful information from a collection of textual data.
Standard text mining systems typically process documents which have been “categorized,” i.e., manually-or automatically-assigned keywords (“tags”) in order to identify their content. Automatic tagging is generally performed by matching words in a document with words from a predetermined list.
SUMMARY OF THE INVENTION
It is an object of some aspects of the present invention to provide improved methods for text mining.
It is a further object of some aspects of the present invention to provide improved methods for comparing multiple documents in a database.
It is yet a further object of some aspects of the present invention to provide improved methods for extracting information from multiple documents in a database.
In preferred embodiments of the present invention, a system for mining text in a database comprises a memory, which stores a hierarchical taxonomy of terms, and a processor, which uses the taxonomy to perform effective mining of the database. Preferably, the system enables quantitative, content-based, textual analysis of a large number of documents in the database, in order to present relationships between two or more entries in the taxonomy.
Preferably, a user provides an input indicating terms of interest (some or all of which may be in the taxonomy), and the processor subsequently discovers relationships between terms in the user's input and terms in the taxonomy. Typically, relationships discovered during text mining comprise co-occurrences of two terms in a single document. Preferably, if the user “selects” one of the relationships generated by the text analysis, the system displays relevant portions of original documents in the database which are associated with the discovered relationship.
In some preferred embodiments of the present invention, terms in the term taxonomy (“taxonomy terms”) can be edited by the user prior to text mining, and the taxonomy can be modified automatically by the processor and/or interactively with the user, responsive to results of the text mining. Typically, interactive editing of the term taxonomy responsive to results of the text mining yields improved results from a subsequent iteration of text mining, and these improved results may themselves be used to modify the taxonomy again. In this manner, the user may derive information of increased value from each iteration of text mining and term taxonomy modification.
In some preferred embodiments of the present invention, the taxonomy generally has a Directed Acyclic Graph (DAG) structure or a tree structure, and comprises groups of related terms (siblings) stored in the hierarchy one level below respective parent entries. For example, under a parent entry, “Countries,” the taxonomy may contain as daughter entries the list of member nations of the United Nations. (The parent entry “Countries” may itself also be a member of a set of siblings in the taxonomy, under a “grandparent” entry, “Political entities.”) Prior to text mining, in this example, the user may add the name of a new member nation, or delete the name of a country whose name has changed. Following text mining of the database, and utilizing results derived therefrom, the user may choose to further edit the term taxonomy (for instance, by adding a new country name or variation thereof).
In some preferred embodiments, the taxonomy has multiple levels, and a broad range of terms in each level, so that the user can narrow or broaden a query prior to an iteration of text mining, in order to optimize the results generated by the processor. For example, if the user would like to investigate President Clinton's foreign policy, she might enter an initial query specifying “Clinton” and all daughter entries of the node “Countries.” To broaden the query, “Countries” could be replaced by “Political entities,” so that a news article, containing the words “Berlin” and “Paris,” but not “Germany” and “France,” would also generate a positive response to the query. Alternatively, to narrow the query, the user could specify a taxonomy node “G7 countries,” instead of “Countries.” In general, a rich, multilevel taxonomy enables the user to enter queries with a desired level of specificity, and to thereby obtain information most relevant to her needs.
In a preferred embodiment, the processor prompts the user to refine the query prior to mining of the database's text, in order to optimize the results generated by the processor. For example, if the user enters a query including the words “Colombia” and “Venezuela,” the processor preferably examines the taxonomy, determines that the two terms are daughter entries of a parent entry, “South American countries,” and asks the user whether the two specified terms should be replaced by the names of all of the countries in South America listed in the taxonomy. Alternatively or additionally, the processor examines daughter entries of “Colombia” and “Venezuela,” and asks the user whether some or all of the daughter entries (for instance, names of cities or politicians) should be added to the query.
In preferred embodiments of the present invention, text mining typically includes determining relationships among terms found in the database which relate to the user's query. Preferably, according to some preferred embodiments of the present invention, the processor subsequently uses these discovered relationships in order to suggest modifications to th

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Term-level text with mining with taxonomies does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Term-level text with mining with taxonomies, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Term-level text with mining with taxonomies will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2882247

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.