Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1999-03-24
2002-04-30
Alam, Hosain T. (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06381616
ABSTRACT:
TECHNICAL FIELD
This invention relates in general to database management systems performed by computers. More particularly, this invention relates to efficient processing of database queries involving selection against heterogeneous data.
BACKGROUND OF THE INVENTION
Heterogeneous data access is becoming an important feature in database systems because it facilitates the integration of information from a variety of data sources. The performance of queries involving external data sources is greatly affected by the cost of transporting data over the network. This cost can be reduced by pushing down subqueries, particularly selections and projections, to the external data source. However, if the external data source uses a different collating sequence, predicates consisting of comparison of a column (also known as a field) with a string constant cannot be pushed down unchanged. A “predicate” is a qualifier or condition for a search.
A collating sequence is essentially the computer equivalent of alphabetical order. One typical collating sequence is the order of characters based on their binary values in ASCII (American Standard Code for Information Interchange) and another typical collating sequence is the order of characters based on their binary values in EBCDIC (Extended Binary-Coded Decimal (BCD) Interchange Code). Although these two sequences have similar characters or collating elements, their collating sequences (sort orders) are significantly different, as shown in FIG.
1
. For example, both collating sequences contain all the capital and lower case letters of the (English) alphabet, but in the ASCII capital letters all sort before lower case letters but in EBCDIC this is reversed. Also, in ASCII digits sort before letters but not in EBCDIC.
A collating sequence is separate from a character (code) set: two countries or languages may use the same character set but have different collating sequences. A list of different language identifiers having similar character sets but different collating sequences is shown in Table 1. The purpose of a collating sequence is to define a total order on character strings. This is done by means of weights assigned to collating elements, which most often correspond to a single character but may consist of a sequence of characters. For example, in Spanish, the sequence “11” is treated as a single collating element. In the simplest case, each character is assigned a unique weight. To determine the ordering of two strings, the weights of the characters are compared proceeding from left to right. If the two strings compare equal on the first level weights, they are compared on second level weights (if any), then on third level weights, and so on.
TABLE 1
Language Name
Language String
Czech
“czech”
Danish
“danish”
Dutch (Belgian)
“dutch-belgian”
Dutch (default)
“dutch”
English (Australian)
“australian”
English (Canadian)
“canadian”
English (default)
“english”
English (New Zealand)
“english-nz”
English (UK)
“english-uk”
English (USA)
“american-english”
Finnish
“finnish”
French (Belgian)
“french-belgian”
French (Canadian)
“french-canadian”
French (default)
“french”
French (Swiss)
“french-swiss”
German (Austrian)
“german-austrian”
German (default)
“german”
German (Swiss)
“swiss”
Hungarian
“hungarian”
Icelandic
“icelandic”
Italian (default)
“italian”
Italian (Swiss)
“italian-swiss”
Norwegian (Bokmal)
“norwegian-bokmal”
Norwegian (default)
“norwegian”
Norwegian (Nynorsk)
“norwegian-nynorsk”
Polish
“polish”
Portuguese (Brazilian)
“portuguese-brazilian”
Portuguese (default)
“portuguese”
Spanish (default)
“spanish”
Spanish (Mexican)
“spanish-mexican”
Spanish (Modern)
“spanish-modern”
Swedish
“swedish”
Many database systems, including SQL Server, support queries over heterogeneous data sources. If the system detects that some tables referenced in a query are managed by external data sources, the system decomposes the query into a set of single-source queries, submits them, and performs any additional processing needed to integrate the data returned from the local queries. External data sources differ greatly in their query capabilities and interfaces, ranging from simple one-table sources with no query capability to full-fledged SQL database systems. Many aspects of heterogeneity in multidatabase systems, such as data model differences, naming differences, format differences, structure differences, and conflicting data, have been studied and are described in Bright, M. W. et al., “A taxonomy and current issues in multidatabase systems”, IEEE Computer, March 1992, pages 50-60.
Many data sources have selection (i.e., filtering) and projection capability. A query optimizer decomposes the query and generates an overall, efficient plan. Often this means delegating as much as possible of the processing to the external data sources, taking into account their processing capabilities. In particular, pushing down selections and projections to an external data source is usually desirable because it reduces the amount of data shipped over the network. However, if the external data source uses a different collating sequence, predicates- involving string comparisons cannot be pushed down unchanged. In practice, systems do not push down such predicates, thereby resulting in a loss of efficiency.
For example, consider a SQL Server running on a Windows NT system that uses the (binary) ASCII collating sequence (WINDOWS NT® is a registered trademark of Microsoft Corp.). Among other things, it provides access to a course table that is stored in a DB2 database running on a mainframe system that uses the (binary) EBCDIC collating sequence. (DB2 is a trademark of International Business Machines Corp.) Suppose the SQL Server receives the query “select CourseNo from course where CourseNo is between ‘CH020’ and ‘CS499’”, and the table has 1000 records and the query will select 100 of the records. If the selection predicate can be pushed down and evaluated at the source, the data communication cost would be reduced by 90% because only the data satisfying the selection predicate would need to be transferred to SQL server.
However, the two systems use different collating sequences, and so the query cannot just be pushed unchanged. Doing so would produce the wrong result. As shown in
FIG. 1
, ‘E’ collates after ‘0’ in the (binary) ASCII collating sequence but before ‘0’ in the (binary) EBCDIC collating sequence. Thus, for example, courses with prefix ‘CHE’ will be selected if the query is evaluated by SQL Server, but not if it is evaluated by DB2. Typical systems handle queries of this type by retrieving all course rows and evaluating the predicate at the query site. There is a need for a mechanism to convert comparisons between a string valued column (field) and a constant to correctly account for differences in collating sequences.
Many atomic SQL predicates are independent of the collating sequence; this is true for operators such as IS NULL and IS NOT NULL. Similarly, comparisons of numeric data types (integer, float, decimal) and dates are independent of the collating sequence. BETWEEN and NOT BETWEEN are simply shorthand notation for two comparisons. IN and NOT IN followed by a set of string constants can be transformed into a set of predicates with equals (=) and not equal (<>) operators. Therefore, the predicates requiring conversion are mainly string comparisons using one of the operators <, ≦, >, and ≧. Predicates with operators=and <>may require conversion when some characters have the same weight in the collating sequence.
Although the art of database query processing is well developed, there remain some problems inherent in this technology, particularly sending a query between databases having different collating sequences. Therefore, a need exists for compensating for differences in collating sequences using predicate conversion that overcomes the drawbacks of the prior art.
SUMMARY OF THE INVENTION
The present invention is directed to a method for use in a heterogeneous database environment including a first data source using a fi
Larson Gustav Per-Åke
Zhang Weiye
Alam Hosain T.
Hwang Joon Hwan
Microsoft Corporation
Woodcock & Washburn LLP
LandOfFree
System and method for speeding up heterogeneous data access... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for speeding up heterogeneous data access..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for speeding up heterogeneous data access... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2833703