Search method search apparatus, and recording medium...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Other Related Categories

C707S793000, C707S793000

Type

Reexamination Certificate

Status

active

Patent number

06338061

Description

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a method and apparatus for searching a text for a specific character string and a recording medium recording a search program and, more particularly, to a symbol string search method and apparatus useful for an approximate matching (approximate search) technique of searching for similar symbol strings and a recording medium recording a symbol string search program.
2. Description of the Prior Art
Techniques of searching for a specific target symbol string from a large amount of one-dimensional nonstructural symbol string to be searched are used in various fields such as document search and gene sequence search. In practical fields, it is being required not only to increase the speed of search but also to develop an approximate matching technique which searches for a symbol string having certain mismatch as well as a perfectly matched symbol string. For example, on the basis of the principle of finite automaton, Japanese Unexamined Patent Publication No. 2-76072 entitled “Variable Length Character String Detection Apparatus”, Japanese Unexamined Patent Publication No. 3-131969 entitled “Symbol String Search Method and Apparatus”, and Japanese Unexamined Patent Publication No. 8-241335 entitled “Fuzzy Character String Search Method and System using Fuzzy Nondeterministic Automaton” are proposed. These techniques will be collectively referred to as the prior art 1 hereinafter.
Separately, a technique which performs the approximate matching at a high speed by using sequential operations combining a simple bit shift operation and bit logic operation is proposed (Wu et. al., Fast String Matching Allowing Errors, Communications of the ACM, Vol. 35, No. 10, pp. 88-91). This technique will be referred to as the prior art 2 hereinafter. This prior art 2 will be briefly described below.
Assume that an m-bit target symbol string in an n-bit symbol string to be searched is to be searched for by allowing symbol insertion/deletion/replacement of k times or less. Insertion is mixing of an unnecessary symbol; e.g., ‘abccde’ has one symbol insertion with respect to ‘abcde’. Deletion is missing of a symbol; e.g., ‘abde’ has one symbol deletion with respect to ‘abcde’. Replacement is replacement of a certain symbol with another; e.g., ‘abade’ has one symbol replacement with respect to ‘abcde’.
Let the symbol string to be searched be T=t[
1
]t[
2
]t[
3
]. . . , t[n] and the target symbol string be P=p[
1
]p[
2
]. . . , p[m]. Search is executed by sequentially calculating a two-dimensional array of matching state bit strings R[i,j] (0≦i≦n, 0≦j≦k) each composed of m bits in increasing orders of i and J. R[i,j] indicates “whether matching is successful in the position of the ith symbol in the symbol string to be searched by allowing insertion/deletion/replacement of j times or less”. If the xth bit from the start bit of this R[i,j] is 1, this indicates that matching is successful in the position of the ith symbol in the symbol string to be searched by allowing mismatch of j times or less from the start symbol to the xth symbol in the target symbol string. Note that the left bit is the start bit in R[i,j]. For example, if R[
10
,
1
]=‘11001’ during the processing, this indicates that matching is successful in the position of the 10th symbol in the symbol string to be searched by allowing mismatch of one time or less to the first, second, and fifth symbols in the target symbol string. The length of the target symbol string is m. Therefore, if the mth bit of R[i,j] is finally 1, this means that the target symbol string is detected in the position of the ith symbol in the symbol string to be searched by allowing mismatch of j times or less. Recurrence formulas of R[i,j] are
R[i,j]=B
(
j
) (for
i=
0)  (1-1)
R[i,j]=Sft
(
R[i−
1,0]) AND
Msk
(
t[i
]):(for
i>
0,
j=
0)  (1-2)
R[i,j]=
(
Sft
(
R[i−
1,
j
]) AND
Msk
(
t[i]
)) OR
R[i−j,j−
1] OR
Sft
(
R[i−
1,
j
−1) OR
Sft
(
R[i,j−
1]) (for
i>
0,
j>
0)  (1-3)
B(j) is an m-bit string in which the first j bits are 1s and other bits are 0s, respectively. Sft(R) is an operation of shifting a bit string R to the right by one bit (1 is set in an empty bit). For example, Sft(‘10010’)=‘11001’ for a five-bit string ‘10010’.
Msk(c) is an m-bit string in which a position where a symbol c exists in a target symbol string is 1 and other bits are 0s, respectively. Msk(c) and a target symbol string are matched from the start bit (leftmost bit). If no symbol c exists in a target symbol string, every bit in Msk(c) is 0. If the symbol c appears a plurality of times in a target symbol string, two or more bits in Msk(c) are 1s, respectively. For example, Msk(‘a’)=‘10100’, Msk(‘b’)=‘01010’, Msk(‘c’)=‘00001’, and Msk(‘d’)=‘00000’ for a target symbol string ‘ababc’. “AND” is a logical product of bits, and “OR” is a logical sum of bits.
The prior art 2 can perform search allowing k or less errors (insertion/deletion/replacement) by scanning a symbol string to be searched once in accordance with the above recurrence formulas. The total calculation amount is the sum of a calculation amount o(nk) of R[i,j] and the initialization time for forming Msk(c). The superior characteristic feature of this prior art 2 is that the processing time is reduced in proportion to the length of symbol string to be searched and to the number of allowable mismatched symbols while fuzzy collation is allowed. Also, high-speed processing is possible because the operation amount of recurrence formulas is very small.
One technique of increasing the speed of search of a large amount of symbol strings to be searched is a method using transposition information of a symbol string. This will be referred to as the prior art 3. In this method, a symbol string to be searched itself is not an object of search, and information (transposition information) indicating the position of each symbol in a symbol string to be searched is used in search. During search, only pieces of transposition information corresponding to symbols contained in a target symbol string are acquired from all pieces of transposition information. Symbol string search is performed by checking the position consistency between these acquired pieces of transposition information.
Generally, when the number of types of symbols forming a symbol string to be searched is large and the number of types of symbols contained in a target symbol string is small, the amount of acquired pieces of transposition information is very small compared to that of the original symbol string to be searched. For example, when a very long symbol string composed of several thousands of different symbols such as a Japanese text is searched for a word of a few characters, the amount of acquired pieces of transposition information is far smaller than that of the original document. This yields the following advantages in search using transposition information.
First, compared to search using a whole symbol string to be searched, the amount of data to be acquired for search is small, so the data transfer amount reduces. This increases the processing speed. Especially when a symbol string to be searched has an enormous amount and hence must be placed in a low-speed secondary storage device, a large increase in the processing speed can be expected. Second, search based upon perfect match between symbol strings can be performed by looking up each acquired transposition information at most once. Therefore, the search can be executed within a shorter time period than when a whole symbol string to be searched is scanned.
The problem of prior arts 1 and 2 described above is that th

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Search method search apparatus, and recording medium... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Search method search apparatus, and recording medium..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Search method search apparatus, and recording medium... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2873828

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.