Retrieving and ranking of documents from database description

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C715S252000

Reexamination Certificate

active

06678690

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to retrieving and/or ranking of documents in a large database, and more particularly relates to a method, a computer system, and a program product for retrieving and ranking the documents in a very large database by dimension reduction of a document matrix using a covariance matrix.
BACKGROUND ART
Recently as database systems handle increasingly large amounts of data, such as, for example, news data, client information, stock data, etc, it becomes increasingly difficult for users of such databases to search for desired information quickly and effectively, with sufficient accuracy. Timely, accurate, and inexpensive detection of new topics and/or events from large databases may provide very valuable information for many types of businesses including, for example, stock control, futures and options trading, news agencies which may desire to quickly dispatch a reporter without needing to maintain a number of reporters posted worldwide, and businesses based on the internet or other fast paced actions which need to know of major and new information about competitors in order to succeed.
Conventionally, detection and tracking of new events in an enormous database is expensive, elaborate, and time consuming work, because generally a searcher of the database needs to hire extra persons for monitoring tasks.
Recent detection and tracking methods used for search engines usually use a vector model for data in the database in order to cluster the data. These conventional methods generally construct a vector q (kwd1, kwd2, . . . , kwdN) corresponding to the data in the database. The vector q is defined as the vector having the dimension equal to numbers of attributes, such as kwd1, kwd2, . . . kwdN which are attributed to the data. The most commonly used attributes are keywords, i.e., single keywords, phrases, names of person(s), place(s). Usually, a binary model is used to create the vector q mathematically in which the kwd1 is replaced by 0 when the data do not include the kwd1, and the kwd1 is replaced by 1 when the data include the kwd1. Sometimes, a weight factor is combined with the binary model to improve the accuracy of the search. Such weight factor includes, for example, the number of times the keywords occur in the data.
FIG.
1
(
a
) and FIG.
1
(
b
) show typical methods for diagonalization of a document matrix D which is comprised of the above described vectors where the matrix D is assumed to be an n-by-n symmetric definite positive matrix. As shown, the n-by-n matrix D may be diagonalized by two representative methods depending on the size of the matrix D. When n is relatively small in the n-by-n matrix D represented at
20
, the method used may typically be Householder bidiagonalization and the matrix D is transformed to the bidaiagonalized form as shown at
22
in FIG.
1
(
a
) followed by zero chasing of the bidiagonalized elements at
24
to construct the matrix Vr consisting of the eigenvectors of the matrix D at
26
.
In FIG.
1
(
b
) another method for the diagonalization is described, and the diagonalization method shown in FIG.
1
(
b
) as represented at
30
may be effective when the number n of the n-by-n matrix D is large or medium. The diagonalization process first executes Lanczos tridiagonalization as shown in FIG.
1
(
b
) at
32
followed by Sturm Sequencing at
34
to determine the eigenvalues wherein “r” denotes the rank of the reduced document matrix. The process next executes Inverse Iteration at
36
so as to determine the i-th eigenvectors associated to the eigenvalues previously found as shown in FIG.
1
(
b
) as shown at
38
.
In so far as the size of the database is still acceptable for application of precise and elaborate methods to complete computation of the eigenvectors of the document matrix D, the conventional methods are quite effective to retrieve and to rank the documents in the database. However, in a very large database, the computation time for retrieving and ranking of the documents is sometimes too long for a user of a search engine. There are also limitations for the resources of computer systems, such as CPU performance and memory capacities needed for completing the computation.
Therefore, there are needs for a system implemented with a novel method for stably retrieving and ranking the documents in very large databases in an inexpensive, automatic manner within acceptable computation time.
DISCLOSURE OF THE PRIOR ART
Some statistical approaches have been proposed using algorithms for information retrieval based on vector space models (see, for example, Baeza-Yates, R., Riberio-Neto, B., “Modern Information Retrieval”, Addition-Wesley, NY, 1999, and Manning, C. Schutze, H., “Foundations of Statistical Natural Language Processing”, MIT Press, Cambridge, Mass., 1999).
Salton, G. et al., “The SMART Retrieval System—Experiments in Automatic Document Processing”, Prentice-Hall, Englewood Cliffs, N.J., 1971, have reviewed the vector space model. They modeled the documents using vectors in which each coordinate of the vectors represents an attribute of the vectors, e.g., a keyword. In binary models of the vector, a coordinate takes on the value unity when the corresponding attribute is present in the documents and zero when the attribute is absent from the document. More sophisticated document vector models take into account weighting of the keyword such as frequency and location of appearance, e.g., in the title, section header, or abstract.
Queries are also modeled as vectors in the same manner as described for the documents. For a given user input query, the relevancy of a particular document is computed by determining the “distance” between the query and each of the document vectors. Although a number of different kinds of norms may be used to determine the “distance” between the query vector and the document vector, the angle between the query and the document vector derived from a scalar product is used as the most common procedure to determine the distance therebetween.
U.S. Pat. No. 4,839,853 issued to Deerwester et al., entitled “Computer information retrieval using latent semantic structure”, and Deerwester et. al., “Indexing by latent semantic analysis”, Journal of the American Society for Information Science, Vol. 41, No. 6, 1990, pp. 391-407 disclose a unique method for retrieving the document from the database. The disclosed procedure is roughly reviewed as follows;
Step 1: Vector space modeling of documents and their attributes.
In latent semantic indexing, or LSI, the documents are modeled by vectors in the same way as in Salton's vector space model. In the LSI method, the relationship between the query and the documents in the database are represented by an m-by-n matrix MN, the entries are represented by mn (i, j), i.e.,
MN=[mn
(
i,j
)].
In other words, the rows of the matrix MN are vectors which represent each document in the database.
Step 2: Reducing the Dimension of the Ranking Problem via Singular Value Decomposition.
The next step of the LSI method executes singular value decomposition, or SVD of the matrix MN. Noises in the matrix MN are reduced by constructing a modified matrix A
k
from the k-th largest singular values
i
, wherein i=1, 2, 3, . . . , k, . . . and their corresponding eigenvectors are derived from the following relation;
MN
k
=U
k
&Sgr;
k
V
k
T
,
wherein &Sgr; is a diagonal matrix with monotonically decreasing diagonal elements of
i
. The matrices U
k
and V
k
are the matrices whose columns are left and right singular vectors of the k-th largest singular values of the matrix MN.
Step 3: Query Processing.
Processing of the query in LSI-based Information Retrieval comprises two further steps: (1) query projection followed by (2) matching. In the query projection step, input queries are mapped to pseudo-documents in the reduced query-document space by the matrix U
k
, and then are weighted by the corresponding singular values
i
from the reduced rank and singular matrix &Sgr;. This process may be described m

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Retrieving and ranking of documents from database description does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Retrieving and ranking of documents from database description, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Retrieving and ranking of documents from database description will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3189554

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.