Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-11-22
2003-10-21
Robinson, Greta (Department: 2177)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06636849
ABSTRACT:
FIELD OF THE INVENTION
This invention generally relates to information processing and management of data. Particularly, the invention relates to efficient searches of data sets using metric spaces, multigrid indexes, and B-grid trees, and, more particularly, to find “hits” for a query of large data sets in a metric space.
BACKGROUND OF THE INVENTION
Advances in computing technologies significantly impact our lives. From simple scheduling to advanced whole genomic analysis, computing and computing applications are at the center of these activities. Among the many facets of computing, data search and management are the most employed and, what some may consider, to be the most critical functions performed. Data search provides quick access to desired data that may be buried in large data sets. As such, search features are extensively employed in data intensive processing applications such as DNA or protein sequence comparisons. A key factor of data searches is efficiency, both time efficiency and computational efficiency.
The search to find an entry in a data set of size N entries, having no specific order, may be extremely time intensive depending on the search system or method that is employed. For example, a simple search on a data set of a million entries without any specific order will generally require one million comparisons. That is, each data element of the data set is processed during the search to provide hits that contain the desired entry. A hit to a query is generally defined as those points that meet the specified neighborhood criteria in the metric space. In the context of efficiency, the search complexity required for a search may be quantified by the order function O( ). Accordingly, the search complexity for this example can be characterized as O(N). Advances in computer/information science, however, have reduced the search complexity to O(log N). Using these advances, the one million comparisons of the above example can be reduced to 20 comparisons in a binary search.
The reduction in search complexity is attributed to the performance of pre-processing. Data pre-processing involves the creation of data indexes prior to search such that the search is performed on the created indexes instead of each data element of the data set. The data can be pre-processed according to a “sort” process, or by developing balanced multiway trees (B-trees). Although the pre-processing of data contributes to an one-time increase in computational complexity of O(NlogN) (i.e. the number of steps required in preparing the data before performing the search), the search complexity is significantly decreased. It is reduced from O(N) to O(log N).
Algorithms to perform the aforementioned pre-processing are known in the art and are described by M. A. Weiss, Data Strucutre and Algorithm Analysis In C++ (Addison-Wesely Publishing Co., 2 edition, 1998); R Sedgewick, Algorithms in C++, (Addison-Wesley Publishing Co., 1992); R. F. Gilberg, B. A. Forouzan, Data Structures, A Pseudocode Approach with C (PWS Publishing Co. 1998); and R. Bayer and E. McCreight, Organization and Maintenance of Large Ordered Indexes, Acta Informatica, Vol. 1, No. 3, pp. 173-189, 1972 which are herein incorporated by reference. Current pre-processing algorithms (e.g. sorting, B-Tree) may be performed in one dimensional data space or in multidimensional space.
As an alternative to sorting in a metric space is to perform k-clustering of the data. A k-clustering of a set N is a partition of the set into k subsets which optimizes some clustering criterion. K-clustering aims to solve the problem that given a weighted graph G=(X, d) on N vertices, where d( . , . ) is the weight function, partition X into k sets S
1
, . . . , S
k
, such that the value of:
&Sgr;
i
&Sgr;
{u, v in Si}
d
(
u, v
)
is minimized. Some of the algorithms developed are described by S. Sahni, T. Gonzalez, P-Complete Approximation Problems, JACM (23), 1976, pp555-566; N. Garg, V. V. Vazirani, M. Yannakakis, Approximate Max-Flow Min-(Multi)-Cut Theorems and Their Applications. SICOMP (25), 1995, pp235-251; W. F. de la Vega, C. Kenyon, A Randomized Approximation Scheme for Metric MAX_CUT. 1998. FOCS'98, pp468-471; and Piotr Indyk, A Sublinear Time Approximation Scheme For Clustering In Metric Spaces. 1999. FOCS'99, pp154-159 which are herein incorporated by reference. However, k-clustering has not been used to build searchable trees in metric space. As such, a particularly useful data preprocessing method is not being exploited to its fullest potential.
Current sorting algorithms, such as, binary sort, merge sort, or quick sort, arranges the data according to a particular structure (i.e. tree, linked list, or array). By themselves, these algorithms generally provide a fast way of locating an exact item in a data set, however, they do not provide a method of locating “approximate” or “similar” elements to a search query. Several solutions have been developed to accommodate for this drawback. These solutions include hardware solutions to increase the computational efficiency and thereby reduce the search time, and software solutions that employ additional algorithms to the sorting algorithms to quickly “screen” out dissimilar elements. These solutions are described in S. Rosenthan. The PF474 Chip, A Coprocessor for String Comparison. Byte, 1984; U.S. Pat. No. 4,490,811, entitled “String Comparator Device System Circuit and Method”; and Patrick A. V. Hall and Geoff R. Dowling, Approximate String Matching, ACM Computing Surveys, Vol. 12, No. 4, pp381-402, December, 1980; U.S. Pat. No. 5,978,797, entitled “Multistage Intelligent String Comparison Method” which are herein incorporated by reference. However, these solutions can be error prone as certain relevant data elements may be overlooked.
From the foregoing, it can be appreciated that there exists a need for systems and methods to search data sets using metric space that provide exact and more relevant “approximate” search results such that search complexity is significantly reduced. By having these systems and methods the drawbacks of the prior art may be overcome.
SUMMARY OF THE INVENTION
The present invention provides systems and methods to search data sets using a metric-space theoretical approach to solve “approximate” or “inexact” searching problems. In an illustrative implementation, traditional “sorting” processes performed in 1-dimensional space are extended to general metric spaces. In this implementation, a multigrid search tree is generated by calculating distance functions of the data in metric space for the data set desired to be searched. The multigrid tree provides an index for each data element of the data set, such that a submitted search query is compared against the multigrid tree to find exact or “hit” matches, where a “hit” may be defined as an inexact, but close enough match based on some criteria. In an application of the method to genetic sequence search problem, outputs from BLAST and FASTA may be used to calculate local distances between data elements, which can be used to create local multigrid trees to provide very efficient search algorithms to process submitted search queries. In this implementation, the multigrid tree may comprise a balanced multigrid tree (B-grid) such that the data elements of the data set being searched are partitioned in equal size grids ensuring more homogeneous search time thereby reducing computational complexity.
Other aspects of the present invention are described below.
REFERENCES:
patent: 3881921 (1975-05-01), Frank
patent: 4136395 (1979-01-01), Kolpek et al.
patent: 4490811 (1984-12-01), Yilanilos et al.
patent: 4689768 (1987-08-01), Heard et al.
patent: 5276741 (1994-01-01), Aragon
patent: 5701256 (1997-12-01), Marr et al.
patent: 5765180 (1998-06-01), Travis
patent: 5978797 (1999-11-01), Yilanilos
patent: 6017282 (2000-01-01), Stefonsky
patent: 6446068 (2002-09-01), Kortge
patent: 6470287 (2002-10-01), Smartt
Daryl Lawrence Bonhaus, “An Upwind Multigrid Method for Solving Viscous Flows on
Tang Yuanhua Tom
Yang Yonghong Grace
Black Linh
GenMetrics, Inc.
Woodcock & Washburn LLP
LandOfFree
Data search employing metric spaces, multigrid indexes, and... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Data search employing metric spaces, multigrid indexes, and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Data search employing metric spaces, multigrid indexes, and... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3156274