Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-01-21
2004-02-17
Vu, Kim (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000, C707S793000, C717S136000, C717S137000, C717S155000, C717S156000
Reexamination Certificate
active
06694310
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The invention relates generally to transformation of data read from a data source so that it is in the proper form to be stored in a data sink and more particularly to transformation of data that is organized in one manner in the data source so that it can be stored in a data sink where it is organized in another manner.
2. Description of Related Art
One of the most valuable assets that any organization has is the data that it accumulates as a result of its operations. Nowadays, an organization stores its data in databases maintained by database systems. An organization of any size at all will have a number of different databases. The reason for this is that each database has an organization peculiar to itself and different uses of data require different organizations of the data. For example, a database which a company uses when dealing with its customers is termed an operational data base. Operational databases are designed to perform transactions: that is, they record when an order is received from a customer, when the order is shipped, when the company is paid, and so on. The operational database is thus organized in a fashion which permits such transactions to be performed as rapidly and reliably as possible.
Another use of databases in an organization is to store data for analysis. The ultimate source of much of this data is the operational databases, but if the data is to be useful for analysis, it must be organized in the database in a fashion which permits the analyst to easily find out what he or she wants to know. One class of such analytical databases is termed data warehouses; the data warehouse contains data from all of the organization's operations and is organized in a fashion which permits global analysis. Another class of such analytical databases is termed data marts; a data mart contains information that is useful for a particular functional area of the organization. The data in a data mart may come from the data warehouse or the operational data or other sources, but it is organized in the way that is most useful for the functional area that uses the data.
As is apparent from the foregoing, many organizations now have a number of differently-organized databases in which much of the information is the same. It has thus become necessary for these organizations to move data from a source database in which it is organized in one way into one or more sink databases in which it is organized in different ways. Because the data in the source database is organized differently from the data in the sink database, it is heterogeneous to the sink database. In the course of the move, the data must be transformed so that it is in the proper form for the sink database and is therefore no longer heterogeneous to it.
Data may be heterogeneous to a given database in two ways: it may be schematically heterogeneous and it may be semantically heterogeneous. Data is schematically heterogeneous when it is organized according to a different model from the one used in the given database, for example, when it is organized to be quickly accessible by customer account number rather than by an interval of time. Data is semantically heterogeneous when the data values have different semantics and structural representations from the ones used in the given database. For example, one database may represent monetary amounts as integer numbers of pennies, while another may represent them as fixed decimal numbers.
Schematic heterogeneity results in two kinds of conflicts between the source and sink databases: naming conflicts and structural conflicts. Naming conflicts include synonym and homonym problems with names of both tables and columns within tables; structural conflicts result when the structure in which a given piece of information appears in the source database is different from the way in which the given piece of information appears in the sink database. For example, the information which may need to be copied from the source to the sink may be a list of date-stock price pairs for the stock of Oracle Corporation. In the source database, the list may be in a table called “Oracle” that has two columns: one for the dates and the other for the stock prices. In the sink database, the list of oracle stock prices may be part of a stock prices table that includes stock prices of other companies, with columns for the date, the name of the stock, and the stock price. In another sink database, the stock prices table may have columns for the date and for each company, with the price for a given company on a given date being in the company's column in the row in which the given date appears in the date column.
Semantic heterogeneity results when the data values in the source have different representations in the sink. The different representations may simply be different names, for example, “Oracle” represents Oracle Corporation in the source, while “Oracle Corp. represents it in the sink, they may be different measurement scales, for instance prices in U.S. $ in the source and prices in Japanese ¥ in the sink, or decimal stock prices in the source and fractional stock prices in the sink, or different measurements, for example, latest stock closing price in one database versus latest trade price in the other, or different granularity of measurement (for example, monthly salary versus yearly salary), and classic data type conflicts. For example, a serial number may be represented as an integer in one database and as a character string in the other. Even where the representations are the same, there may be size differences. One database may use a 10-character string for the serial number, while another may use a 16-character string.
As will be apparent from the foregoing, data transformation is a necessary and complex part of copying information from one database to another. One way to do data transformation is to write a special program to perform the transformation. The problem with this approach is that the person writing the program must be an expert in the source and sink database systems, in the structures of the of the source and sink databases, in the semantics of the data being transformed, and in the programming language used for the special program. Writing transformation programs is thus strictly a matter for data processing experts, with all of the concomitant expense and delay.
The need for data processing experts was reduced by the development of a data transformation system by Sagent Technology, Inc. This system is described in the paper, Colin J. White, Building a corporate information system: the role of the datamart, which is dated February, 1997 and was available in December, 1999 at the Sagent Technology Web site, www.sagent com.
FIG. 1
gives an overview of this system
101
: System
101
permits a user to copy data stored in data source
111
to data sink
113
while doing the necessary transformation. The user defines the transformation using a set of pre-defined data transforms that are supplied with system
101
. Included in these transforms are a number of source read transforms
115
for reading data from different kinds of data sources
111
, a number of intermediate transforms
117
, which define different transformation operations for dealing with inhomogeneities between the data in source database
111
and the data as it is stored at
120
in data sink
113
, and a number of write transforms
119
for writing data to different kinds of data sinks
113
.
To make the transformation required to copy data from source
111
to sink
113
, the user specifies a data flow plan
107
which is a sequence of the transforms. The first is the source read transform
115
required for the data source. For example, if the data source is a relational data base, source read transform
115
is a simple SELECT query (no WHERE clauses) running on a real or virtual table defined in source database
111
. Then come as many of the intermediate transforms as it appears to the user are required to perform the transformatio
D'silva Anil
Davison Jay W.
Yu Tsae-Feng
Nelson Gordon E.
Oracle International Corporation
Pham Hung
Vu Kim
LandOfFree
Data flow plan optimizer does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Data flow plan optimizer, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Data flow plan optimizer will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3301658