Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2001-04-27
2004-10-12
Corrielus, Jean M. (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06804688
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to detecting and tracking evolution of new events and/or classes of documents in a large database, and more particularly relates to a method, a system, and a program product for detecting and tracking the evolution of the new events and/or classes of the documents in a very large database by simultaneously taking into account a temporal parameter such as time, a date, or a year and any combinations thereof in a vector modeled document.
BACKGROUND OF THE ART
Recent database systems must handle increasingly large amounts of data, such as news data, client information, stock data, etc. Users of such databases find it difficult to search desired information quickly and effectively with sufficient accuracy. Therefore, timely, accurate, and inexpensive detection of new topics and/or events from large databases may provide very valuable information for many types of businesses including, for example, stock control, futureS and options trading, news agencies which may afford to quickly dispatch a reporter without affording a number of reporters posted worldwide, and businesses based on the Internet or other fast paced actions which need to know major and new information about competitors in order to succeed thereof.
Conventionally, detection and tracking of new events in enormous databases is expensive, elaborate, and time consuming work, because a searcher of the database usually needs to hire extra persons for monitoring thereof.
Recent detection and tracking methods used for search engines mostly use a vector model for data in the database in order to cluster the data. These conventional methods generally construct a vector q (kwd
1
, kwd
2
, . . . kwdN) corresponding to the data in the database. The vector q is defined as the vector having the dimension equal to numbers of attributes, such as kwd
1
, kwd
2
, . . . kwdN which are attributed to the data. The most commonly used attributes are keywords, i.e., single keywords, phrases, names of person(s), place(s). Usually, a binary model is used to create the vector q mathematically in which the kwd
1
is replaced to 0 when the data do not include the kwd
1
, and the kwd
1
is replaced to 1 when the data include the kwd
1
. Sometimes, a weight factor is combined to the binary model to improve the accuracy of the search. Such weight factor includes, for example, appearance times of the keywords in the data.
In such vector model of the database, conventionally the clustering of the data in the database is first carried out based on the keywords. The procedure of the clustering mostly uses the scalar product of the vector q. In the clustering of the data, each vector corresponding to the data in the database is categorized into some clusters having a predetermined range of the scalar product. Then the clusters are further clustered using a date/time stamp attributed to the data for detecting and tracking the new event. The conventional search method uses a two-step clustering process for detecting and tracking the new events as described above, and therefore, the search procedure becomes elaborate and expensive work.
Therefore, there are needs for providing a system implemented with a novel method for detecting new events and/or classes and tracking evolution of the new events in an inexpensive and automatic manner.
DESCRIPTION OF RELATED ART
In “Maximizing text-mining performance”, IEEE Intelligent Systems, July/August, 1999, pp. 1307-1313 by S. Weiss et al. at IBM T. J. Watson Laboratory, a method for detecting and tracking new events, which uses a combination of decision tree algorithms and adaptive sampling, is disclosed. The method disclosed by Weiss et al. may provide a method for detecting and tracking new events, but has the disadvantage of requiring training sets of sample documents to compile a dictionary.
In “Topic detection and tracking pilot study final report”, Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, February, 1998, Morgan Kaufmann San Francisco, pp. 194-218, 1998, by J. Allan et al., at University of Massachusetts, Amherst, CMU, “Dragon Systems” a probabilistic (Hidden Markov Model) approach is used to cluster documents based on words and sentences in articles. In the “Dragon Systems”, there is also the disadvantage of requiring a training set to start the system. UMass (University of Massachusetts) uses a content based LCA (local content analysis) method, and this method is very slow so that the search speed becomes unacceptably slow. The Carnegie-Mellon University's system is directed to search multimedia data such as audio news and video data. It is based on probabilistic methods.
In “Intelligent Information Retrieval”, IEEE Intelligent Systems, July/August, 1999, pp. 30-31 by Y. Young et al., a method which uses a group average clustering and an independent time stamp-weighting factor is disclosed. The weighting factor is also disclosed in “Clustering algorithms”, pp. 419-442 in W. Frakes and R. Baeza-Yates (Editor), “Information Retrieval: data structures and algorithms”, Prentice-Hall, Englewood Cliffs, N.J., 1992, and E. Rasmussen and “Recent trends in hierarchic clustering: a critical review”, Information Processing and Management, Vol. 24, No. 5, pp. 577-597, 1988.
In “CMU Infomedia-KNN-based Topic Detection”:
http://www.informedia.cs.cmu.edu./HDWBerk/tsld001.htm, a training index with pre-labeled topics is provided.
The detail is:
45000 broadcast News stories from 1995 to 1996,
3178 different news topics occurring appeared larger than 10 times
Search for top 10 related stories in training index
Lookup topics for related stories
Re-weight topics by story relevance (select top 5)
At 5 topics, Recall is reported to be 0.491 and Relevance is reported to be 0.482
In “NIST Topic Detection and Tracking Evaluation Project”:
http://www.itl.nist.gov/iaui/894.01/proc/darpa98/index.htm, U.S. National Institute of Standard and Technology (NIST) discloses the results conducted in 1997Xg as listed in Table I.
TABLE I*
RUN
% Miss
% f/a
% Recall
% Prec
CMU1
38
0.09
62
67
CMU2
17
0.32
83
43
Dragon
39
0.08
61
69
UMass1
66
0.09
34
53
UMass2
67
0.5
33
16
*% Miss denotes miss rate,
% f/a denotes false alarm rate,
% Recall denotes recall rate, and
% Prec denotes precision rate.
In “DARPA Broadcast News Workshop”:
http://www.itl.nist.gov/iaui/894.01/proc/darpa99/index.htm, a dozen or so reports US institutions which received funding for event tracking and detection are described (TDT2: Topic Detection and tracking 1998).
SUMMARY OF THE INVENTION
An object of the present invention is to provide a novel method for detecting new events and/or classes of the documents and tracking evolution thereof in a database.
Another object of the present invention is to provide a novel system for detecting new events and/or classes of the documents and tracking evolution thereof in a database.
Further, another object of the present invention is to provide a novel program product for detecting new events and/or classes of the documents and tracking evolution thereof in a database.
The present invention essentially utilizes a novel method for detecting and tracking of the new events and/or classes of the documents in a very large database simultaneously taking into account a time stamp parameter such as date and time in a vector modeled document.
In a first aspect of the present invention, a method for detecting new events and/or classes of documents and tracking evolution thereof in a database, said new event and/or classes of said documents being added to said database, said documents including attribute data related to a temporal parameter, said method comprises steps of:
providing vectors of said documents based on attribute data simultaneously including said temporal parameter included in said document, and
detecting said new events and/or classes of said documents and tracking evolution thereof simultaneously using said vectors.
In the first aspect of the present invention, said attributed data may include at least one keyword, and said keyword is w
Kobayashi Mei
Malassis Loic
Piperakis Romanos
Corrielus Jean M.
Dang Thu Ann
Dougherty Anne V.
Hwang Joon Hwan
International Business Machines - Corporation
LandOfFree
Detecting and tracking new events/classes of documents in a... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Detecting and tracking new events/classes of documents in a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Detecting and tracking new events/classes of documents in a... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3305745