Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-12-16
2002-08-13
Mizrahi, Diane D. (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000
Reexamination Certificate
active
06434558
ABSTRACT:
TECHNICAL FIELD
The present invention relates generally to database systems, and more particularly to a system for maintaining lineage information for data stored in a database.
BACKGROUND OF THE INVENTION
A relational database is a collection of related data that is organized in related two-dimensional tables of columns and rows wherein information can be derived by performing set operations on the tables, such as join, sort, merge, and so on. The data stored in a relational database is typically accessed by way of a user-defined query that is constructed in a query language such as Structured Query Language (“SQL”). A SQL query is non-procedural in that it specifies the objective or desired result of the query in a language meaningful to a user but does not define the steps to be performed, or the order of the steps in order to accomplish the query.
Moreover, very large conventional database systems provide a storehouse for data generated from a variety of locations and applications (often referred to as data warehouses or data marts). The quality and reliability of the storehouse is greatly effected by the quality and reliability of its underlying data. Because the data can originate from a variety of sources, the quality and reliability of data will often depend on the quality and reliability of the source. Moreover, the matter is further complicated because individual rows of data within a single table can originate from different sources.
Currently, if the data in a database is questionable, there is no easy way to track the history of the data to determine where it originate or how it may have been changed. As such, it would be advantageous to users of a database to have tools that allow the users to trace aspects of the history (i.e., where the data originated and how the data has been transformed) of the data in a database.
The task of tracing aspects of the history of data in a database is further complicated in enterprise-wide databases (such as data warehouses) where data may flow into the database from direct as well as indirect data sources, (i.e., the data may have been collected from another database that itself directly or indirectly derived the data). In other words, the data may have made multiple “hops” before reaching the destination database of interest.
As such, there is a need for providing method and apparatus for determining information about the history (i.e., lineage) of data contained within a database.
SUMMARY OF THE INVENTION
Briefly, the present invention is directed toward database technology that provides users with powerful tools necessary to manage and exploit data. The present invention provides a system and method for tracking the lineage of data within database tables. According to an aspect of the invention, data within the tables are tracked by attaching lineage information to the data, preferably, by adding a lineage identifier to each row in a table. Data that share a common lineage can be identified by virtue of sharing a common lineage identifier.
The lineage identifier can then be used to trace the source of the data, i.e., data having a common identifier share a common history. Additionally, the lineage identifier can provide details about transformations undergone by the data. For example, the lineage identifier can act as a pointer to a detailed history files of operations that were performed on the data to transform it into its current form. Preferably, the lineage identifier tracks program modules as well as specific versions of the program modules that transformed the particular data under consideration.
As a result of the data lineage mechanism, users can trace the history data in a table, even when that data has made several hops among databases, where the data has undergone one or more transformations, or where the transforming program modules have themselves under gone revision. This provides users with a powerful mechanism to have higher confidence in the quality and reliability of data in a database and to quickly trace and correct errors in the data.
Preferably, the lineage data type is an identifier that is universally unique and is optimized to provide little impact on the performance of the data base. For example, by providing a sufficient size identifier to ensure its uniqueness while minimizing storage size. More preferably, the data lineage data type is a sixteen-byte number.
REFERENCES:
patent: 4799152 (1989-01-01), Chuang et al.
patent: 5844794 (1998-12-01), Keeley
patent: 6016497 (2000-01-01), Suver
patent: 6230212 (2000-05-01), Morel et al.
patent: 6092071 (2000-07-01), Bolan et al.
patent: 6112207 (2000-08-01), Nori et al.
Shek, et al. (IEEE publication, Mar. 1999) discloses exploiting data lineage fro parallel optimization in external DBMSs in Data Engineering, 1999 proc. p. 256.*
Marathe, Ap. et al. (IEEE publication, 2001) discloses tracing lineage of array data in Scientific and Statistical Database Management, 2001, pp. 69-78.
Kiernan Casey L.
MacLeod Stewart P.
Rajarajan Vij
Microsoft Corporation
Mizrahi Diane D.
Woodcock & Washburn LLP
LandOfFree
Data lineage data type does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Data lineage data type, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Data lineage data type will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2929790