METHOD AND SYSTEM FOR EXTRACTING CHARACTERISTIC STRING,...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06473754

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates to a method and system for extracting a character string indicative of a feature of contents described in a document, a method and system for searching a document database for a document or documents having contents similar to those described in a document specified by a user with use of the first-mentioned method and system, and a storage medium for storing a searching program therein.
As use of personal computers and Internet spreads, electronic documents have been explosively increased in these years. And its acceleratingly increasing spread is estimated in future. In such circumstances, such a strong demand has been enhanced that a user wants to search quickly and efficiently for a document or documents containing information desired by the user.
One of techniques for satisfying such a demand is a full-text search. In the full-text search, documents to be searched are registered as a text in a computer system for creation of a database, and the system searches the database for a document or documents containing a search character string (which will be referred as a query term, hereinafter) specified by a user. In this way, the full-text search is featured in that, since the searching is carried out for the character string itself in the documents, any word can be searched unlike a prior art keyword searching system based on a previously-set keyword.
However, in order to reliably search for a document or documents containing information desired by the user, it is necessary for the user to make a complex search conditional expression accurately indicative of user's search intention and to enter it into the system. This is a hard business for ordinary users who are not experts on information search.
For the purpose of eliminating such troublesomeness, much attention is now focused on a relevant document searching technique for showing as an example a document (which will be referred to as a ‘seed’ document, hereinafter) containing contents desired by a user per se to search for a document or documents similar to the seed document.
Disclosed as one of the relevant document searching methods is, for example, a technique (which will be referred to as the prior art 1, hereinafter) for extracting words contained in a seed document through morphological analysis to search for a relevant document or documents based on the extracted words, as in JP-A-8-335222.
In the prior art 1, words contained in a seed document are extracted through morphological analysis to search for a relevant document or documents containing the words. For example, when the seed document is a document 1 of “ . . .
User's manner when the portable phone is in use becomes important.) . . . ”, words such as
(portable phone)”,
(manner)” and
(important)” are extracted to look up a word dictionary through morphological analysis. As a result, the system can search for a document 2 of “ . . .

(Use of portable phones in trains is banned) . . . ” containing
as a relevant document.
However, the prior art 1, which uses the word dictionary for word extraction, has two problems which will be mentioned below.
First one of the problems is that, when a word not listed in the word dictionary indicates seed-document's essential contents (which will be referred to as central concept, hereinafter), there is impossibility of accurately searching for the document's central concept even when similar searching is carried out with use of the other words, because the essential word cannot be extracted as a search word from the seed document. In other words, when information desired by the user is a new word, the new word not listed in the word dictionary results undesirably in search of a document or documents having concepts shifted from the target central concept.
The second problem is that, even when the word desired by the user is listed in the word dictionary, a document or documents having concepts shifted from the central concept may be undesirably searched depending on how to extract the word. For example, words such as
,
, and
are extracted from the above document 1 of “ . . .
. . . ”. However, there is undesirably a likelihood that a document 3 of “ . . .

(I got an advice about how to talk on phone) . . . ” is calculated low in its similarity because the word
cannot be extracted.
This results from the fact that search words are all extracted from the word dictionary.
The problems in the prior art 1 have been explained above.
For the purpose of solving the above problems, there has been suggested a technique (which will be referred to as the prior art 2, hereinafter) in Japanese Patent Application No. 9-309078, by which character strings each having n continual characters of a type (which strings will be referred to as the n-grams, hereinafter) such as ‘Kanji’ or ‘Katakana’ are mechanically extracted according to the character types to search for a relevant document or documents, without using any word dictionary.
In the prior art 2, how to extract the n-gram is changed according to the character types to extract meaningful n-grams (which will be referred to as characteristic strings, hereinafter). For example, 2-grams are mechanically extracted from a character string of Kanji characters (which string will be referred to as a Kanji character string, hereinafter); while a character string of Katakana characters having the longest length (which string will be referred to as a Katakana longest character string, hereinafter), that is, a Katakana longest character string itself is extracted from character strings of katakana characters (which strings will be referred to as Katakana character strings, hereinafter). In this case, characteristic strings such as
,
,
,
,
, and
are extracted from the above document 1 of “ . . .

. . . ” as a seed document. That is, since the character string
is also extracted without missing, even the document 3 of “ . . .

. . . ” can be extracted with a correctly calculated similarity.
In the prior art 2, however, there is a possibility of extracting even an n-gram across the words of a Kanji character string available to make a compound word from the Kanji character string. For this reason, use of this search method causes calculation of a similarity of such a document that is not similar to the seed document in contents, which results in a problem that such a document as not to be associated with the seed document is undesirably searched. For example, for the characteristic string of
extracted from the document 1 of “ . . .

. . . ” as a seed document, its similarity is calculated, which undesirably results in erroneous search of a document 4 of “ . . .

(In order to prevent charging, it must be grounded.) . . . ” as a relevant document.
For solving the above problem, there has been suggested a technique (which will be referred to as the prior art 3, hereinafter) for extracting a characteristic string using statistical information of 1-gram, as shown in a Journal of the Information Processing Society of Japan, pp. 2286 to 2297, Vol. 38, No. 11, November 1997.
In the prior art 3, with respect to each of 1-grams appearing in a document to be registered, a probability of 1-gram forming a head of a word (which probability will be referred to as a head-position probability, hereinafter) as well as a probability of 1-gram forming a tail of a word (which probability will be referred to as a tail-position probability, hereinafter) are previously calculated at the time of registering the document. In this case, it is assumed that a word consists of a string of an single type of characters such as Kanji or Katakana (which string will be referred to as a single character type string, hereinafter) and is delimited at a character type boundary such as the boundary between Kanji and Katakana, and that the 1-gram located directly after the character type boundary is regarded as a head 1-gram in a word and the 1-gram located directly before the character type boundary is regarded as a tail 1-

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

METHOD AND SYSTEM FOR EXTRACTING CHARACTERISTIC STRING,... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with METHOD AND SYSTEM FOR EXTRACTING CHARACTERISTIC STRING,..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and METHOD AND SYSTEM FOR EXTRACTING CHARACTERISTIC STRING,... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2949801

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.