Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-08-24
2003-12-16
Rones, Charles (Department: 2175)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000
Reexamination Certificate
active
06665668
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to document retrieval techniques for retrieving a registered document in accordance with an input query expression and displaying information of the retrieved document.
2. Description of the Related Art
In recent years, the number of electronic documents formed by a word processor and the like is increasing, and it is expected that the number of such documents increases in the future. A database used for the document retrieval is also becoming large in scale. Therefore, the set of documents, which is a search result obtained by the document retrieval is also becoming large. It is difficult for a user to find a really desired document from them.
In order to solve this problem, there is a ranking technique as the related art. The ranking technique is specifically described in “Ranking Algorithms”, by Donna Harman, Information Retrieval, pp. 363-392. This technique is hereinafter called “Related Art 1”. Related Art 1 provides a technique of calculating a factor which shows the possibility of being similar to the contents of a query expression (sentence, document, or a sequence of words) designated by the user. An example of the contents will be described with reference to FIG.
2
.
A retrieval (or search) is realized by a simple vector operation. Each element of this vector corresponds to words after excluding the overlapped words from all words appearing in the database (however, stop words and the like are excluded). In the example shown in
FIG. 2
, the elements are constituted of (factors, information, help, human, operation, retrieval, systems). “1” is set at the corresponding position if the query expression contains the element, and “0” is set at the corresponding position if the query expression does not contain the element. In this manner, vector Q
0
of the query expression can be formed. That is, vector Q
0
(1, 1, 0, 1, 0, 1, 1) is formed for query expression “human factors in information retrieval systems”.
A vector of document is similarly formed for each document in the database. Vector V
1
(1, 1, 0, 1, 0, 1, 0) is formed for Document 1 containing “factors”, “information”, “human” and “retrieval”. Vector V
2
(1, 0, 1, 1, 0, 0, 1) is formed for Document 2 containing “factors”, “help”, “human” and “systems”. Vector V
3
(1, 0, 0, 0, 1, 0, 1) is formed for Document 3 containing “factors”, “operation” and “systems”.
A score used for ranking is calculated from vector operation Vi·Q
0
between vector Q
0
of the query expression and vector Vi (i=1, 2, 3) of each document. The calculation results are score “4” for Document 1, score “3” for Document 2, and score “2” for Document 3. Each score represents the similarity to the query expression judged by the system. The document having the higher score has the higher possibility of being similar to the contents of the query expression.
Instead of expressing the element of the vector as “1” or “0”, the element may be expressed by the weight of word (calculated from the location frequency of the word, the location deviation of the word in the document database, or the like). For example, if the weight of “factors” is “2”, the weight of “information” is “3”, the weight of “human” is “5” and the weight of “retrieval” is “3”, then vector V′
1
(2, 3, 0, 5, 0, 3, 0) can be formed for Document 1. Similarly, if the weight of “factors” is “2”, the weight of “help” is “4”, the weight of “human” is “5” and the weight of “systems” is “1”, then vector V′
2
(2, 0, 4, 5, 0, 0, 1) can be formed for Document 2. Furthermore, if the weight of “factors” is “2”, the weight of “operation” is “2” and the weight of “system” is “1”, then vector V′
3
(2, 0, 0, 0, 2, 0, 1) can be formed for Document 3.
The score of each document can be calculated from vector operation V′
1
·Q
0
between vector V′
1
and query expression vector Q
0
. The calculation results are score “13” for Document 1, score “8” for Document 2 and score “3” for Document 3. Each score represents the similarity to the query expression, which is judged by the system in consideration of the weight of word, i.e., the importance degree of word. The document having the higher score has the higher possibility of being similar to the contents of the query expression. That is, the search result shows that Document 1 has the highest possibility of being similar to the contents of the query expression.
In Related Art 1, the factor which shows the possibility of being similar to the contents of the query expression is calculated. By browsing the documents in accordance with this factor, the desired document can be searched at high speed from the large-scale document database. However, whether or not the search result document is really the desired document is judged by the user by actually reading the contents of the document. As the technique of supporting the instant judgement of whether or not the document obtained as the search result is really the desired document, there is the document highlighting technology which is hereinafter called “Related Art 2”.
In Related Art 2, when the contents of the document obtained as the search result is displayed, a portion containing a character string of the query expression designated by the user is displayed in a display format (hereinafter called “a highlight”) different from that of other character string portions. The display format includes color, size, font, style (bold or roman) and the like. By displaying the portion containing the character string of the query expression in the display format different from that of other character string portions, it is possible to recognize at once the position containing the word. As a result, whether or not the document is the desired document can be judged faster than reading the document from the start thereof.
A word is often used as the element of the vector used by the ranking technique of Related Art 1. In a language such as English language in which each word is written in a delimiting manner, all words excepting stop words (such as “in” and “the”) are used as the vector elements. In a language such as Japanese language in which each word is not written in a delimiting manner, a character string obtained by dividing the different character types, consecutive n characters (“n” is a predetermined integer of “1” or larger), a word derived with reference to a dictionary or the like, and so forth are used as the vector elements. As a result, if a document or a long sentence is designated as the query expression-to execute the retrieval and the document obtained as the search result is displayed in accordance with the highlighting technology shown in Related Art 2, the number of character strings to be highlighted becomes large. Thereby, there is a problem that the important portion becomes difficult to be found.
This problem will be described with reference to
FIG. 3
by taking a newspaper article database as an example. In this example, a newspaper article document regarding the stadium invitation for world cup of football is designated as the query expression to execute the retrieval.
First, character strings used for the retrieval are extracted from document “Football match stadiums for W-Cup will be determined next month, selection right attributed to Association. The organizing arrangement committee for the 2002 football world cup under the joint auspices of Japan and Korea opened on 29th, a governor/mayor meeting is held by calling special directors from fifteen local self-governing bodies which are candidates for organizing the stadium. For the number of stadiums in Japan, Federation International de Football Association (FIFA) . . . ” which is designated as the query expression. In the example shown in
FIG. 3
, nouns, katakana characters and gerunds, which are extracted by referring to a dictionary and the like, are extracted as the character string used for the retrieval. As a result, “football, W-Cup, match, stadium, next, month, determined, selection, right, Association,
Inaba Yasuhiko
Matsubayashi Tadataka
Sugaya Natsuko
Tada Katsumi
Ushiroji Yousuke
Antonelli Terry Stout & Kraus LLP
Hitachi , Ltd.
Rones Charles
Wu Yicun
LandOfFree
Document retrieval method and system and computer readable... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Document retrieval method and system and computer readable..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Document retrieval method and system and computer readable... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3102627