Penn DB Group's logo
Providing Provenance Through Workflows and Databases
Arrow; just used for page layout. People
Arrow, used for page layout Publications
Arrow, used for page layout Research
Arrow, used for page layout Classes
Arrow, used for page layout Seminar
Arrow, used for page layout Resources
   
Search this website

Providing Provenance Through Workflows and Databases

Executive Summary

Data provenance is a fundamental issue in the processing of scientific information and beyond. Two lines of research have been pursued in recent years with direct bearing on the issues of data provenance. In one of them, provenance in workflows, the emphasis has been on extracting provenance from logs of events marking the execution of different modules over various initial and derived datasets. In the other line of research, provenance in databases, the emphasis has been on the propagation of provenance through the operators that make up database views, or on propagation of provenance through copy/cut-and-paste operations within and among databases.

These two bodies of work have employed different techniques and at first glance their results appear quite different. However, in many scientific applications database manipulations co-exist with the execution of workflow modules and the provenance of the resulting data should integrate both kinds of processing into a usable paradigm.

An analysis of existing work on data provenance in workflows and in databases shows that the main difficulties in unifying these two different kinds of data provenance are:

  • (i) The lack of a data model that is rich enough to capture the interaction between the structure of the data and the structure of the workflow.
  • (ii)The lack of a high-level specification framework in which database operators and workflow modules can be treated uniformly.

The objective of this proposal is to provide a framework for overcoming these difficulties, and to provide tools that allow a truly comprehensive approach to defining, manipulating, managing and querying the provenance of scientific data.

The method that will be followed is to use a data model that supports nested collections, and a functional language (the Nested Relational Calculus, NRC) to describe workflow specifications and database transformation over nested collections. Using this model and language, the theoretical underpinnings of a joint framework for defining, manipulating, managing and querying data provenance will be developed, along with algorithms for managing provenance and reducing provenance overload. Techniques for opening up the "black box" style of provenance in workflow systems will also be explored. While the theoretical foundation of the framework will be based on NRC, the results will be transitioned to an analogous foundation based on XML and XQuery, which will be used for the implementation. A prototype will be developed, and the feasibility of the approach evaluated.

The work builds on the PIs's expertise and past work on provenance in workflows and provenance summarization techniques, provenance in databases, and NRC query and update languages.

Some references

  • Reconcilable Differences [.pdf] 
    International Conference on Database Theory (ICDT) (2009)
    Todd J. Green   Zachary Ives   Val Tannen   

  • Containment of conjunctive queries on annotated relations [.pdf] 
    International Conference on Database Theory (ICDT) (2009)
    Todd J. Green   

  • Differencing Provenance in Scientific Workflows
    International Conference on Data Engineering (ICDE) (2009)
    Zhuowei Bao   Sarah Cohen Boulakia   Susan Davidson   Anat Eyal   Sanjeev Khanna   

  • Optimizing User Views for Workflows
    International Conference on Database Theory (ICDT) (2009)
    Olivier Biton   Susan Davidson   Sanjeev Khanna   Sudeepa Roy   

  • Detecting and Resolving Unsound Workflow Views for Correct Provenance Analysis
    Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD) (2009)
    Peng Sun   Ziyang Liu   Susan Davidson   Yi Chen   

  • WOLVES: Achieving Correct Provenance Analysis by Detecting and Resolving Unsound Workflow Views
    International Conference on Very Large Databases (VLDB)(demo) (2009)
    Peng Sun   Ziyang Liu   Susan Davidson   Yi Chen   

  • PDiffView: Viewing the Difference in Provenance of Workflow Results
    International Conference on Very Large Databases (VLDB)(demo) (2009)
    Zhuowei Bao   Sarah Cohen Boulakia   Susan Davidson   Pierrick Girard   

  • Annotated XML: Queries and Provenance [.pdf] 
    Proceedings of ACM Symposium on Principles of Database Systems (PODS) (2008)
    Nate Foster   Todd J. Green   Val Tannen   

Software

PDiffView is a software system that takes as input two runs of the same specification and shows how their executions differs. This can be used to understand why the results of workflow runs differ. [prototype] [video]

Project Members

Susan Davidson   Val Tannen   Sanjeev Khanna   Zhuowei Bao   Sudeepa Roy   Todd J. Green   

Partner organizations

Arizona State University
University Paris-Sud, Orsay

Funding

This material is based upon work supported by the National Science Foundation under Grant No. 0803524.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Levine Hall
3330 Walnut Street
Philadelphia, PA 19104
 

Last update: 11/23/09     Comments