Determining trends using text mining

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06532469

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to knowledge discovery in collections of data, and specifically to text mining.
BACKGROUND OF THE INVENTION
In recent years, the volume of text documents available on computers and computer networks is growing rapidly. It is virtually impossible to read all the available documents containing information of importance on a given subject. In order to find desired information, search engines have been developed which provide a user with documents which mention selected words or terms. The user may use Boolean patterns with “and,” “or” and “not” terms to more distinctly define the scope of the desired documents. However, the user cannot always define precisely which are the desired documents or keyword combinations. In addition, search engines do not provide an integrated picture of the distribution and impact of given terms in an entire corpus of documents.
Text mining is used to find hidden patterns in large textual collections. Text mining tools provide a human-tangible description of the information included in the textual collection. Because the amount of information is so large, a crucial feature of text mining tools is the way the information is organized and/or displayed. To limit the amount of information that a user must digest, it is common to define a context group which defines the information of interest to the particular user. Normally, the context group includes those documents which include one or more terms from a user-defined set.
A central tool in text mining is visualization of the complex patterns that are discovered. One such visualization approach is described, for example, in an article by Feldman R., Klosgen W., and Zilberstien A., entitled “visualization Techniques to Explore Data Mining Results for Document Collections,” in
Proceedings of the
3
rd International Conference on Knowledge Discovery and Data Mining
(1997), pp. 16-23, which is incorporated herein by reference. This article describes a concept relationship analysis in which a set of concepts (or terms) are searched for in a corpus of textual data formed of a plurality of documents. The concept relationship analysis searches for groups of concepts which appear together in relatively large numbers of documents, and these concepts are displayed together.
One method of representing concept relationships is by displaying context graphs. In context graphs, the concepts (or terms) which appear together in large numbers of documents are designated by nodes. Each two nodes are connected by an edge which has a weight which is equal to the number of documents in which the terms of both nodes appear together. In order to limit the amount of data displayed, only edges which have a weight above a predetermined threshold are displayed. In some context graphs, the concepts which appear in nodes are chosen from a list of interesting terms defined by the user.
In many cases, the corpus of documents is formed of several groups of documents, for example, documents from different dates, and it is desired to apprehend concept relationships as they develop in time. An article by Lent B., Agrawal R., and Srikant R., entitled “Discovering Trends in Text Databases,” in
Proceedings of the
3
rd International Conference of Knowledge Discovery and Data Mining
(1997), pp. 227-230, which is incorporated herein by reference, describes a method of detecting trends in textual collections formed of documents with timestamps, which are partitioned into time groups according to a selected granularity. The textual collection is mined for a group of combinations of words (referred to as phrases) which appear in the documents of the collection. Each combination is given frequency-of-occurrence values for each time group. A user requests to view the frequencies of occurrence of those combinations for which the occurrences follow a desired pattern. However, this method does not give the user any feel for the development of trends in the textual documents as a whole.
In an article entitled “Trend graphs: Visualizing the evolution of concept relationships in large document collections,” by Feldman R., Aumann Y., Zilberstien A., and Ben-Yehuda Y., in
Proceedings of the
4
th International Conference of Knowledge Discovery and Data Mining
(1998), which is incorporated herein by reference, a graphical tool is described for analyzing and visualizing dynamic changes in concept relationships over time.
SUMMARY OF THE INVENTION
It is an object of the present invention to provide methods and apparatus for displaying trends that are discovered in large collections of information.
In some aspects of the present invention, the trends relate to appearances of terms found by text mining in groups of documents.
It is another object of some aspects of the present invention to provide methods and apparatus for displaying the evolution of concept relationships in groups of documents.
It is another object of some aspects of the present invention to provide methods and apparatus for displaying differences between patterns of term appearances in different groups of documents.
It is still another object of some aspects of the present invention to provide methods and apparatus for determining major changes in patterns of term appearances in groups of documents.
In preferred embodiments of the present invention, a corpus of documents is divided into sub-groups defined by a differentiating parameter, such as the dates of the documents, or their origin. For each sub-group of documents, a separate context graph is prepared, and the relationship between the graphs is calculated.
In some preferred embodiments of the present invention, the differentiating parameter defines an order of the context graphs. The context graphs are preferably displayed sequentially, either one after another or one above the other. Each graph is preferably displayed with indications which show the differences between the present graph and the previous graph. Preferably, each edge in the graph is marked to indicate a difference between its weight in the present graph and its weight in the previous graph. Alternatively or additionally, each edge is marked to indicate the difference between its weight in the present graph and its average weight in a predetermined number of previous graphs.
Preferably, the edges are marked graphically, for example, using different colors, widths, and/or lengths to indicate the weight differences. In a preferred embodiment of the present invention, four indications are used for the following groups of edges: new edges, edges with increased weights, edges with decreased weights, and edges with substantially stable weights.
In some preferred embodiments of the present invention, the differentiating parameter is the date of the documents. Preferably, all the documents from a single period are considered to belong to a single sub-group. The periods may be of substantially any length, e.g., from minutes to years, according to a user selection. Alternatively or additionally, the differentiating parameter comprises the origins of the documents, such as the authors, editors, countries of origin or -the original languages of the documents. Further alternatively or additionally, substantially any other parameter may be used, such as the length of a document, or the average salary or number of employees of the company mentioned most frequently in a document.
In a preferred embodiment of the present invention, the context graphs are displayed such that all nodes that are common to two or more of the graphs appear in substantially the same relative locations in the graphs. Therefore, the layout of the displayed form of the context graphs is prepared after all the nodes of all the graphs are known. Alternatively, the locations of the nodes and/or the distances between the nodes are used to indicate the importance of the terms of the nodes. In such cases, animation techniques are preferably used to aid the user to follow the changes in the positions of the nodes.
In some preferred embodiments of the pres

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Determining trends using text mining does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Determining trends using text mining, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Determining trends using text mining will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3053279

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.