Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2001-06-28
2003-12-30
Amsbury, Wayne (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
Reexamination Certificate
active
06671683
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates to a similar document retrieving apparatus which designates one or plural document data from a document database (i.e., set or assembly of document data) which is electronically stored as strings of character codes and machine treatable or processible, or designates an arbitrary sentence not involved in this database, as a typical example. The similar document retrieving apparatus retrieves one or more documents similar to the designated typical example from the document database. Furthermore, the present invention relates to a relevant keyword extracting apparatus which extracts one or more keywords relating to the “typical example” from the document database. The relevant keyword extracting apparatus presents the extracted keywords to the users of this document database as an aid for comprehension of the retrieved document contents, or as a hint for preferable retrieval conditions (i.e., queries). Especially, the present invention makes it possible to perform highly accurate document retrieval and keyword extraction.
Due to recent spread of wordprocessors and personal computers as well as large-scale and low-cost storage media, such as CD-ROM and DVD-ROM, and development of network, such as Ethernet, all of the documents or most of character information can be practically stored as strings of character codes in a full text database. Such database is now widely used.
According to a conventional full text database, in retrieving the documents, a Boolean expression of keywords is generally designated as queries. It is checked whether or not a designated keyword appears in the documents. And, a document set satisfying the Boolean expression is obtained as a retrieval result.
Recently, a so-called document ranking technique is introduced and practically used. According to this ranking technique, the relevancy between each document in the obtained document set and the retrieval conditions (i.e., queries) is obtained according to a so-called “tf·idf” method or the like. Then, the documents are ranked in order of relevancy and are presented to the users.
However, this conventional full text database system is disadvantageous in the following points.
(1) When no appropriate keywords come up in mind or are found, it is difficult to designate appropriate retrieval conditions (i.e., queries).
(2) Describing a complicated Boolean expression requires a high skill and enough time.
(3) For the synonymy problem, there will be a possibility that an intended document cannot be retrieved.
In view of these problems, research and development for a similar document retrieving system or a relevant keyword extracting system has recently become vigorous so as to effectively retrieve documents similar to a designated typical example or to extract and display relevant keywords relating to the designated documents or word set.
U.S. Pat. No. 4,839,853 discloses a conventional method for retrieving similar documents, which is called as LSI (latent semantic indexing) method.
To make clear the difference between the present invention and the LSI method, the gist of the LSI method will be explained.
When applied to a document database D containing N document data, the LSI method mechanically extracts a keyword, i.e., a characteristic word representing each document, to record the frequency of occurrence (i.e., the number of times) of each keyword appearing in each document. It is now assumed that a total of M kinds of keywords are extracted from the document database D.
Extracted keywords are aligned according to a dictionary order or an appropriate order. Then, a frequence-of-appearance f
dt
of a t-th keyword is expressed as an element of d-th line and t-th row of a matrix F. Then, trough a matrix operation called as incomplete singular value decomposition, this matrix F is approximately decomposed into a product of a matrix U of N lines and K rows having document-side singular vector in each row, a diagonal matrix &Lgr; of K lines and L rows having singular values aligned as diagonal elements, and a matrix V of K lines and M rows having a keyword-side singular vector in each line. In this case, K is sufficiently small compared with N and M. As a result, the original frequency-of-occurrence matrix F can be approximately expressed by a lower-rank matrix.
A total of K document-side singular vectors are obtained through the above decomposition. Thus, a feature vector U
d
of the document d is obtained as a K-dimensional vector containing respective d-th components of the obtained K document-side singular vectors. Similarly, a total of K keyword-side singular vectors are obtained through the above decomposition. Thus, a feature vector V
t
of the keyword t is obtained as a K-dimensional vector containing respective t-th components of the obtained K keyword-side singular vectors.
Subsequently, calculation of similarity and relevancy is performed according to the following three procedures so as to obtain documents and keywords having higher similarities and relevancies, thereby realizing the similar document retrieval and the relevant keyword extraction.
(1) The similarity between two documents a and b is obtained by calculating an inner product U
a
·U
b
between the document feature vectors U
a
and U
b
of these documents a and b.
(2) The relevancy between two keywords Ka and Kb is obtained by calculating an inner product V
a
·V
b
between two keyword feature vectors V
a
and V
b
of these keywords Ka and Kb.
(3) Keyword extraction result from an arbitrary (external) document is represented by a M-dimensional vector E having components representing frequency-of-occurrence values of M keywords appearing in this document. A retrieval condition document feature vector P
e
corresponding to this external document is represented by an expression U
e
=&Lgr;
−1
VE. Then, the similarity between this external document and the document d in the document database is obtained as a product U
d
·U
e
. The above-described procedures are a fundamental framework of the LSI method.
However, if the keyword frequency-of-appearance f
dt
is directly used in the application of the LSI method to an actual document database, the feature vector obtained will be somewhat deviated due to presence of longer documents or frequently appearing keywords. This will significantly worsen the accuracy of similar document retrieval.
Hence, the LTC method conventionally used in the relevant ranking of a document retrieving system or a comparative method is introduced to convert or normalize the keyword frequency-of-occurrence f
dt
. Then, a frequency-of-occurrence matrix F is created so as to contain the normalized frequency-of-occurrence values. Then, the incomplete singular value decomposition is performed to obtain a feature vector.
For example, according the LTC conversion, the following equation is used to calculate a frequency-of-occurrence LTC (f
dt
) based on the actual frequency-of-occurrence f
dt
the number n
t
of documents containing the keyword t. A matrix containing this value is subjected to the incomplete singular value decomposition.
LTC
⁡
(
f
dt
)
=
(
1
+
log
2
⁢
f
dt
)
⁢
log
2
⁡
(
1
+
N
n
i
)
∑
j
⁢
{
(
1
+
log
2
⁢
f
dj
)
⁢
log
2
⁡
(
1
+
N
nj
)
}
2
(
1
)
However, the conversion of keyword frequency-of-occurrence by the conventional LSI method causes the following problems.
Analysis according to the LSI method is performed on the assumption that a d-th line of the matrix F represents the feature of document d and a t-th row of the matrix F represents the feature of keyword t. In a first conversion, a square-sum of line elements can be normalized to 1. However, a square-sum of row elements cannot be normalized to 1. Accordingly, the performed conversion becomes asymmetric between the document side and the keyword side. Thus, the simple conversion using the above equation 1 cannot normalize both of the document side and the keyword side to 1. Such asymmetry can be found in a conversion using other equation.
Furthermore, when a loga
Al-Hashemi Sana
Amsbury Wayne
Matsushita Electric - Industrial Co., Ltd.
Parkhurst & Wendel L.L.P.
LandOfFree
Apparatus for retrieving similar documents and apparatus for... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Apparatus for retrieving similar documents and apparatus for..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus for retrieving similar documents and apparatus for... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3174396