Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-04-27
2002-12-17
Alam, Hosain T. (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06496817
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to subsequence matching method in time-series databases, and particularly to such a method which improves performance by using duality in constructing windows, in time-series databases.
BACKGROUND OF THE INVENTION
First, we define some terminology needed in further description of the present invention.
A “sequence” of length n is an array of n entries. “Time-series data” are sequences of real numbers, representing values at specific time points. A “time-series database” is the database that stores time-series data.
The time-series data stored in a time-series database are called “data sequences.” The sequences given by a user are called “query sequences.” Finding data sequences similar to the query sequence from the database is called “similar sequence matching.”
In the above definition, two sequences are said to be “similar” if the distance between them is less than or equal to the user specified “tolerance” &egr;. We define that two sequences X and Y are in “&egr;-match” if the distance between X and Y is less than or equal to &egr;. We define “n-dimensional distance computation” as the operation that computes the distance between two sequences of length n.
In the above distance computation, the present invention is independent of the specific distance computation method. For easy understanding the present invention, however, we describe it based on the Euclidean distance computation method. Given two sequences X={x
0
, x
1
, . . . , X
n−1
} and Y={y
0
, Y
1
, . . . , y
n−1
} of the same length n, the Euclidean distance between X and Y is defined as
∑
i
=
0
n
-
1
⁢
⁢
(
x
i
-
y
i
)
2
.
If a sequence S includes a sequence A(i.e., A is a part of S), A is called a “subsequence” of S. Similar sequence matching can be classified into the following two categories:
Whole matching: Given N data sequences S
1
, S
2
, . . . S
N
, a query sequence Q, and the tolerance &egr;, we find those data sequences that are in &egr;-match with Q. Here, the data and query sequences must have the same length.
Subsequence matching: Given N data sequences S
1
, S
2
, . . . , S
N
of varying lengths, a query sequence Q, and the tolerance &egr;, we find all the sequences S
i
, one or more subsequences of which are in &egr;-match with Q, and the offsets in Si of those subsequences.
A “Window” is a unit of dividing sequences. According to the dividing method, windows are classified into a sliding window and a disjoint window. The windows starting from every possible offset in a sequence are called “sliding windows.”
FIG. 1
a
is an example drawing of a method that divides a sequence into sliding windows of size
4
. In
FIG. 1
a,
reference no.
201
is a sequence, and reference no.
202
are sliding windows of size
4
. The windows starting from multiple offsets of window size are called “disjoint windows.”
FIG. 1
b
is an example drawing of a method that divides a sequence into disjoint windows of size
4
. In
FIG. 1
b,
reference no.
203
is a sequence, and reference no.
204
are disjoint windows of size
4
.
In subsequence matching, “false dismissals” are the subsequences that are in &egr;-match with the given query sequence but missed by errors, and “false alarms” are the subsequences that are not in &egr;-match with the query sequence but selected as similar subsequences. False dismissals and false alarms should not occur in the above similar sequence matching.
The function used to extract f, which is less than n, features from a sequence of length n is called the “feature extraction function.” To use a feature extraction function in similar sequence matching, the function should guarantee no false dismissals. To guarantee no false dismissals, the feature extraction function is satisfied some conditions that are presented in Agrawal, R., Faloutsos, C., and Swami, A., “Efficient Similarity Search in Sequence Databases,” In Proc. the 4th Int'l Conf. on Foundations of Data Organization and Algorithms, Chicago, Ill., pp. 69-84, October 1993.[Reference 1] and Faloutsos, C., Ranganathan, M., and Manolopoulos, Y., “Fast Subseqeunce Matching in Time-Series Databases,” In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Minneapolis, Minn., pp. 419-429, May 1994.[Reference 2]
We also define some notation to be needed in further description of the present invention.
Len(S) is the length of sequence S. S[k] is the k-th entry of the sequence S, S[i:j] is the subsequence that is including entries from the i-th one to j-th, and S[i:j] can be represented as S[i:k]S[k+1:j]. Next, when S is divided into disjoint windows, si represents the i-th disjoint window of sequence S. Lastly, &ohgr; is the length of the sliding or disjoint window.
Recently, the large amount of time-series data are occurred in various areas such as stock prices, growth rates of companies, exchange rates, biomedical measurements, and weather data. And, owing to faster computing speed and larger storage devices, there have been a number of efforts to utilize the large amount of time-series data. Especially, similar sequence matching in time-series data has become an importance research topic in data mining that is one of new database applications.
In the below description, we explain the previous similar sequence matching methods in time-series databases.
In the previous method of [Reference 1], authors have introduced a solution for the whole matching problem. The outline of the solution is as follows.
First, each data sequence of length n is transformed into an f-dimensional point by using the feature extraction function, and this point is indexed using the f-dimensional index. Only a small number of features are extracted because of the difficulty in storing high-dimensional sequences in the multidimensional index due to dimensionality problem in multidimensional indexes (called “dimensionality curse”). Next, a query sequence is similarly transformed to an f-dimensional point, and a range query constructed using the point and the given tolerance &egr;. Then, the multidimensional index is searched to evaluate the query, a candidate set constructed consisting of the feature points that are in &egr;-match with the query sequence. This method guarantees no false dismissal, but may cause false alarms because it uses only f features instead of n.
Thus, for each candidate sequence, the actual data sequence is accessed from the disk; the distance from the query sequence computed; and the candidate is discarded if it is a false alarm. This last step, which eliminates false alarms, is called the “post-processing step.”
And, in the previous method of [Reference 2], authors have proposed the subsequence matching method as a generalization of the whole matching method of [Reference 1]. In the present invention, we simply call this method “FRM” by taking authors' initials. The outline of the method is as follows.
In subsequence matching, subsequences similar to the query sequence can be found anywhere in a data sequence. In FRM, to find all possible subsequences, they use a sliding window of size &ohgr; starting from every possible offset in the data sequence. Then, they divide a query sequence into disjoint windows of size &ohgr; and retrieve similar subsequences by using those disjoint windows. They transform each sliding window to a point in a lower dimensional space. Since too many points are generated to be stored individually in an index, they construct minimum bounding rectangles(MBRs) that contain hundreds or thousands of points, using a heuristic method, and then, store those MBRs into a multidimensional index. Lastly, they try to do the subsequence matching on query sequences of various lengths.
For subsequence matching on query sequences of various lengths, FRM presents and uses the following two theorems.
Theorem 1
When two sequences S and Q of the same length are divided into p disjoint windows si and q
Moon Yang-Sae
Whang Kyu-Young
Alam Hosain T.
Alam Shahid
Bachman & LaPointe P.C.
Korea Advanced Institute of Science & Technology
LandOfFree
Subsequence matching method using duality in constructing... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Subsequence matching method using duality in constructing..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Subsequence matching method using duality in constructing... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2997212