|
Providing Provenance Through Workflows and Databases
|
||||||||||||||||
|
Providing Provenance Through Workflows and DatabasesExecutive Summary
Data provenance is a fundamental issue in the processing of scientific
information and beyond. Two lines of research have been pursued in recent
years with direct bearing on the issues of data provenance. In one of them,
provenance in workflows, the emphasis has been on extracting provenance
from logs of events marking the execution of different modules over various
initial and derived datasets. In the other line of research, provenance in
databases, the emphasis has been on the propagation of provenance through
the operators that make up database views, or on propagation of provenance
through copy/cut-and-paste operations within and among databases.
The objective of this proposal is to provide a framework for overcoming these difficulties, and to provide tools that allow a truly comprehensive approach to defining, manipulating, managing and querying the provenance of scientific data. The method that will be followed is to use a data model that supports nested collections, and a functional language (the Nested Relational Calculus, NRC) to describe workflow specifications and database transformation over nested collections. Using this model and language, the theoretical underpinnings of a joint framework for defining, manipulating, managing and querying data provenance will be developed, along with algorithms for managing provenance and reducing provenance overload. Techniques for opening up the "black box" style of provenance in workflow systems will also be explored. While the theoretical foundation of the framework will be based on NRC, the results will be transitioned to an analogous foundation based on XML and XQuery, which will be used for the implementation. A prototype will be developed, and the feasibility of the approach evaluated. The work builds on the PIs's expertise and past work on provenance in workflows and provenance summarization techniques, provenance in databases, and NRC query and update languages. Some references
SoftwarePDiffView is a software system that takes as input two runs of the same specification and shows how their executions differs. This can be used to understand why the results of workflow runs differ. [prototype] [video]Project MembersSusan Davidson Val Tannen Sanjeev Khanna Zhuowei Bao Sudeepa Roy Todd J. GreenPartner organizationsArizona State UniversityUniversity Paris-Sud, Orsay FundingThis material is based upon work supported by the National Science Foundation under Grant No. 0803524.Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. |
||||||||||||||||
|
|||||||||||||||||