Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2002-01-15
2003-03-04
Corrielus, Jean M. (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06529916
ABSTRACT:
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
This invention was made with U.S. Government support under contract no. NCC5-305 awarded by the National Aeronautic and Space Administration (NASA). The U.S. Government may have certain rights in this invention.
1. Field of the Invention
The present invention relates to database systems and, more particularly, to constructing, maintaining, and utilizing a multidimensional indexing structure to answer linear optimization queries to a database containing records with numerical attributes.
2. Background of the Invention
A linear optimization query is a special type of database query which returns database records whose weighted linear combination of numerical attributes are ranked as the top N records among the entire database, either maximally or minimally. Equivalently, a linear optimization query may be posed as the problem of finding data records whose weighted values are above or below a threshold. Out of the returned results, the top N records are then selected. While such a query may request the maximal or minimal N records based on a specific linear optimization criteria, the query processing algorithm does not require separate procedures for the two optimization conditions. This is because by simply reversing signs of weights in the linear equation, a maximization problem is translated into a minimization one, and vice versa. The present invention processes optimization queries in a similar way by translating them into maximization queries first.
Depending on application scenarios, weights (coefficients) of the linear criterion may or may not be known at the data ingestion time. Were they known during data ingestion time and remain constant, the weighted linear combination could be pre-computed and stored to answer future queries. In many cases, the coefficients are dynamic and the same set of data records are shared by different applications. Pre-computing for all applications thus may not be feasible. An emphasis of the present invention is on the dynamic cases where the coefficients are unknown and determined at the query moment. A goal of this invention is to index the records in an efficient way such that when a new query is issued, only a fraction of records in the database need to be evaluated to satisfy the query. Although its query response time may not be as fast as the response time of a static query, our invention narrows the performance gap between the two.
The linear optimization query is a general case of linearly weighted ranking, which is vastly applied in information retrieval and summarization. Instead of presenting a long table with all surveyed parameters of every record, useful information is often summarized by taking a linearly weighted combination of those parameters. The top N records are then listed and discussed. Examples of such information summarization can be found in many places. For example, every year, the news magazine US News and World Report conducts studies of college education and ranks the school performance by a linear weighting of numerical factors such as academic reputation (25%), faculty resources (20%), retention rate (20%), student selectivity (15%), financial resources (10%), alumni giving (5%), and graduation rate performance (5%). Top-ranking national and regional colleges are listed. One can find many similar examples such as cities with the highest cost of living, towns with the highest crime rate, the five hundred largest global companies, and so on. While all these examples are based on linearly weighted ranking, the coefficients assigned to the linear criterion are mostly static. The allocation of linear weighting may reflect the opinion of information collectors such as news agencies or consumer opinion groups. However, information subscribers like magazine readers do not actively participate in the information summarization process. We argue information subscribers should be active participants of the information retrieval and summarization process. In the above examples, linear weighting and record ranking can be performed at the request of readers and subscribers, perhaps through a personalized web page. College applicants should be able to choose a set of coefficients that reflect to their own valuation of a school. City residents should decide what cost of living index appears in the ranking criterion by their own life styles. One formula does not apply to all people.
Dynamic information summarization in the form of adjusting weights of the linear criterion has been practiced in many business and scientific applications. For example, mortgage companies and banks develop linear models to estimate consumers' credit scores, probabilities of mortgage repayment, default risk, etc. These models are often built on a common set of parameters such as loan-to-value ratio, length of credit history, revolving credit, credit utilization, debt burden and credit behavior. From this set of parameters, models for financial products may be developed. In the area of public health and environmental science, scientists extract parameters from satellite images, digital elevation maps, and weather data to model disease outbreak probabilities, rodent population, air pollution, etc. As an example, a group of researchers from Johns Hopkins University, A. Das, S. R. Lele, G. E. Glass, T. Shields, and J. A. Patz, developed a model of the distribution of the population of Lyme disease vectors in Maryland from Geographical Information System (GIS) digital images (See “Spatial modeling of vector abundance using generalized linear mixed models: application to Lyme disease,” submitted to Biometrics for publication). Their models are frequently revised by applying different statistical analysis techniques and training data sets. In addition, scientists like to adjust their models to ask ‘what’ if questions. A speedy and accurate response from the database would greatly assist model development and verification.
The study of multidimensional indexing structures has been a major subject in database research. Indexing structures have been developed to answer different types of queries, including:
1. find record(s) with specified values of the indexed columns (exact query);
2. find record(s) that are within [a
1
. . . a
2
], [b
1
. . . b
2
], . . . ,[z
1
. . . z
2
] where a, b, and z represent different dimensions (range query);
3. find the K most similar records to a user-specified template or example (K-nearest neighbor query); and
4. find the top N records to a user-specified linear optimization criterion (linear optimization query).
Substantial work can be found to address the previous three types of queries, while much less is available in prior art about the fourth one. In prior art, linear optimization queries are often referred to the problem of finding a single data entry which maximizes or minimizes the given linear criterion, with the assumption that the constraints are given in the form of linear inequalities. In such cases, the feasible solution space is the intersection of half-spaces defined by those linear inequalities. When both the query and constraints are given at query time, the query processing problem is a linear programming problem Solutions such as the simplex method and the ellipsoid method were well studied and references can be found in most linear programming textbooks. In addition, recent discovery in randomized algorithms suggested possible ways to reduce expected query response time. Seidel reported the expected time is proportional to the number of constraints (R. Seidel, “Linear programming and convex hulls made easy,” Proceedings of the 6th ACM Symposium on Computational Geometry, pp. 211-215, 1990). When the constraints are given ahead of time to enable the preprocessing of records, query response can be made faster by trading off storage space. Matousek reported a data structure that is based on a simplicial partition tree, while parametric search is applied to prune the partition tree (J. Matousek and O. Schwa
Bergman Lawrence David
Castelli Vittorio
Chang Yuan-Chi
Li Chung-Sheng
Smith John Richard
Corrielus Jean M.
Ryan & Mason & Lewis, LLP
Taisa Tanekazu
LandOfFree
Multidimensional indexing structure for use with linear... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Multidimensional indexing structure for use with linear..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Multidimensional indexing structure for use with linear... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3021141