Welcome to Penn Provenance

Funded by NIH NIBIB #1U01EB020954-01, “Approximating and Reasoning about Provenance” and NSF ACI-1547360, “Data Provenance: Provenance-Based Trust Management for Collaborative Data Curation”.

In many Big Data applications today, such as Next-Generation Sequencing, data processing pipelines are highly complex, span multiple institutions, and include many human and computational steps. The pipelines evolve over time and vary across institutions, so it is difficult to track and reason about the processing pipelines to ensure consistency and correctness of results. Provenance-enabled scientific workflow systems promise to aid here – yet such workflow systems are often avoided due to perceptions of inflexibility, lack of good provenance analytics tools, and emphasis on supporting the data consumer rather than producer. We propose to better incentivize the adoption of workflow and other provenance tracking tools:

  1. Instead of requiring a single workflow system across the entire pipeline, which can be inflexible, we allow for integration across multiple autonomous systems (provenance- enabled workflow systems, provenance tracking systems for languages like Python and R, etc.), and even across steps performed without any provenance tracking at all.
  2. We develop provenance reasoning capabilities specifically useful to the data provider, such as provenance analytics across time, sites, and users; finding the code modules that best explain why two results are different; regression testing to determine whether a code change would affect prior results; and reconstructing missing provenance for steps that were not captured. These capabilities are expected to lead to wider tracking of data provenance, and ultimately to more consistent, reproducible, and reliable science. We will validate this hypothesis through the evaluation of our technologies within a Next-Generation Sequencing pipeline run by one of the PIs with collaborators at other institutions.
  3. We are investigating mechanisms for combining curation or annotations from multiple users, computing trust, and determining consensus annotations based on provenance.
  4. We are developing generalizations of data provenance for “non-relational” operators such as those in linear algebra, time series manipulations, and more.

Papers

Tools

Habitat Data Management Platform

The Habitat Platform is a basic data lake or “data habitat”, with a set of services for

  1. large object, graph, and relation stores,
  2. authorization and access control
  3. the ability to quickly retarget different underlying storage systems.

PROV Storage is a web site where users can upload provenance data in either the PROV-XML or PROV-N format. Uploaded PROV data can also be accessed via web services.

ProvenanceTracker

Tracker is a software program that captures process and file access events for provenance data. The event logs are collected in an intermediate data storage, and a user can browse and manipulate them. You can follow below instructions to install and run the program.

You can view and manipulate event logs captured by Tracker on LogViewer.

Quick Start Guide

  1. Download the distribution available on your platform.
  2. Install the distribution. You may need to see the requirements.
  3. Configure platform-specific policy settings: Audit Policy for Windows, audit for Linux and OSX.
  4. Make sure tracker.conf is ready before the program runs. You may use the one included in the distribution. For details, see configuration.
  5. Run the program as administrator. For console mode, see Console Mode.
  6. Use your PROV Storage login credentials for authentication.

PROVision Provenance Instrumentation & Analysis Tool

The PROVision tool enables you to take provenance and reconstruct missing details, for instance which input records were used to generate an output record.

Team

The Penn Provenance Team includes members from computer science, bioengineering, biology, and medicine. Key participants and collaborators include:

  • Zachary Ives, CIS
  • Junhyong Kim, Biology
  • Susan Davidson, CIS
  • Sampath Kannan, CIS
  • Val Tannen, CIS
  • Brian Litt, Bioengineering and Neurology
  • Abdussalam Alawini, CIS
  • Soonbo Han
  • John Frommeyer, SEAS
  • Nan Zheng, CIS
  • Stephen Fisher, Biology