Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-12-16
2002-03-12
Mizrahi, Diane (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000
Reexamination Certificate
active
06356901
ABSTRACT:
TECHNICAL FIELD
The present invention relates generally to databases and data warehouses, and more particularly to an elegant system and method for integrating and constructing a data warehouse or other data storage area.
BACKGROUND OF THE INVENTION
In recent decades, as the computational and storage capacity of computer systems has improved, business record keeping has for the most part become automated. Thus, businesses have come to rely heavily on large-scale databases of sales, billing, payroll and other data, typically stored on magnetic storage media.
From the perspective of the information technology professional, such databases can conventionally be organized and maintained using variety of Database Management Systems (“DBMS”). Among such database systems, those adhering to a “relational” model, Relational Database Management Systems (“RDBMS”), are likely the most popular.
A relational database is a collection of data that is organized in related two-dimensional tables of columns and rows. Data in a table can be accessed and manipulated by performing set operations on the tables, such as join, sort, merge, and so on. These operations are typically initiated by way of a user-defined query that is constructed in a query language such as Structured Query Language (SQL). SQL queries consist of high level commands which typically describe the data set to be selected, retrieved or processed.
While RDBMS databases clearly represent an advance over earlier manual record keeping, from an enterprise wide perspective, such databases are often not organized or designed with a view toward optimal efficiency. In particular, an enterprise will typically maintain and support several databases and record file stores, each respectively serving the particular needs of sales, billing and marketing departments, for example, or even of particular individuals. The information in these databases will generally overlap somewhat, resulting in duplication of effort to one degree or another. In addition, this decentralized, piecemeal collection of databases may render many applications, particularly applications involving enterprise-wide data relationships, difficult and cost ineffective to implement.
In an effort to address this problem, more and more businesses have turned to the creation of “data warehouses.” These typically large-scale databases provide a storehouse for data, generally accumulated over years, and generated from a variety of database locations and applications. In the best case, such data warehouses are organized and maintained using a DBMS flexible enough to implement all applications of current interest, as well as to anticipate future applications.
Unfortunately, while data warehouses should result in efficiency savings over the long run, the design and creation of such databases is in general a time consuming, costly project. For example, because data must be integrated from a variety of database locations and applications, multiple data conversions must take place (e.g., of text, spreadsheet or tabular data). Each such conversion might require its own application software.
In addition, even where all source data is in a relational tabular (RDBMS) format, the selection of whole tables of data, perhaps consisting of millions of rows, for export to a data warehouse can be very time consuming and taxing of computational resources. In particular, while query languages allow the specification of data selection-using constructs similar to natural English, they do not allow developers to specify the actual procedure used to accomplish data selection. Thus, the procedure which actually implements the data selection may not be suitable for all applications (e.g. data warehouse creation). For example, an SQL selection of a stored table for export to a data warehouse would first result in the creation of the entire table in memory, possibly stretching the system's computational resources to the limit.
In addition, depending on the desired characteristics of the data warehouse, source data bound for the data warehouse may need to first undergo transformation (e.g. division, combination, creation of new and complex mathematical relationships). Such transformation adds yet another step in the design process, and depending on its complexity, may require the application of programming languages requiring specialized knowledge (e.g., COBOL, C++).
Moreover, once data bound for a data warehouse is selected and optionally transformed, it must be integrated and stored with other data, all conformed to new access constructs consistent across the data warehouse. For example, if two tables, each with a customer Social Security number field, are bound for a data warehouse, one may use an “SSNum” naming convention, while another might use a “Social Security” naming convention to refer to the same data type. Ensuring such consistency may be time consuming, as it may involve: 1) selecting the first table using SQL; 2) changing field names; 3) storing the first table; 4) selecting the second table using SQL, and so on.
In short, what is needed is a cost effective, time efficient system and method for integrating and constructing a data warehouse or other data storage area in accordance with desired design characteristics.
SUMMARY OF THE INVENTION
Briefly, the present invention is directed toward technology that provides users with powerful tools necessary to manage and exploit data. In particular, the present invention is directed to a single “data pump” application environment, which allows developers to specify the import (selection), transformation and export of data to a desired destination location, all on a streaming, contiguous basis. As buffering allows the synchronization of the import and export steps, source data can be fetched, processed and exported a subset at a time, without the creation of whole sets of source data in memory. Thus, computational resources are spared.
The data pump processes may preferably be initiated by execution of instructions in pre-stored task modules. For example, execution of such instructions may cause selection (importation) of specified source data. In addition, the present invention preferably utilizes data interface drivers for each respective type of source data, thus allowing conversion to a relational (tabular) format. Therefore, selection instructions in task modules may consist of query language commands, such as SQL commands.
The data pump process may also execute instructions in pre-stored task modules to accomplish transformations of data in accordance with whatever degree of complexity is desired. Preferably, the task modules incorporate conventional language scripts, allowing specification of complex data transformations in conventional programming languages with a minimum of technical knowledge.
Thus, the present invention combines the ease of use of conventional, high-level query commands, with the control capability of more procedural programming languages.
Moreover, the present invention preferably incorporates easy to use user interfaces. These allow, for example, the intuitive specification of destination location (for export), as well as the correspondence of source and destination access constructs (e.g. of source and destination tabular column names). Thus, the present invention allows the specification of consistent access constructs across destination site data, when this characteristic is desired.
In addition, the present invention may also incorporate the creation of data lineage identifiers which track the task modules described above, or specific versions of such task modules. Thus, the lineage identifier can provide details about data pump transformations undergone by the data. For example, the lineage identifier can act as a pointer to detailed history files of operations that were performed on data and transformed it into its current form.
Preferably, a lineage identifier is added to each row in a table. The lineage identifier can then be used to trace the source of the data, i.e., data having a common identifier
Kiernan Casey L.
MacLeod Stewart P.
Microsoft Corporation
Mizrahi Diane
Woodcock & Washburn LLP
LandOfFree
Method and apparatus for import, transform and export of data does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for import, transform and export of data, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for import, transform and export of data will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2842645