Welcome to Penn Provenance
Funded by NIH NIBIB #1U01EB020954-01, “Approximating and Reasoning about Provenance” and NSF ACI-1547360, “Data Provenance: Provenance-Based Trust Management for Collaborative Data Curation”.
In many Big Data applications today, such as Next-Generation Sequencing, data processing pipelines are highly complex, span multiple institutions, and include many human and computational steps. The pipelines evolve over time and vary across institutions, so it is difficult to track and reason about the processing pipelines to ensure consistency and correctness of results. Provenance-enabled scientific workflow systems promise to aid here – yet such workflow systems are often avoided due to perceptions of inflexibility, lack of good provenance analytics tools, and emphasis on supporting the data consumer rather than producer. We propose to better incentivize the adoption of workflow and other provenance tracking tools:
- Instead of requiring a single workflow system across the entire pipeline, which can be inflexible, we allow for integration across multiple autonomous systems (provenance- enabled workflow systems, provenance tracking systems for languages like Python and R, etc.), and even across steps performed without any provenance tracking at all.
- We develop provenance reasoning capabilities specifically useful to the data provider, such as provenance analytics across time, sites, and users; finding the code modules that best explain why two results are different; regression testing to determine whether a code change would affect prior results; and reconstructing missing provenance for steps that were not captured. These capabilities are expected to lead to wider tracking of data provenance, and ultimately to more consistent, reproducible, and reliable science. We will validate this hypothesis through the evaluation of our technologies within a Next-Generation Sequencing pipeline run by one of the PIs with collaborators at other institutions.
- We are investigating mechanisms for combining curation or annotations from multiple users, computing trust, and determining consensus annotations based on provenance.
- We are developing generalizations of data provenance for “non-relational” operators such as those in linear algebra, time series manipulations, and more.
Papers
- Compact, Tamper-Resistant Archival of Fine-Grained Provenance. Nan Zheng, Zachary G. Ives. Proc. VLDB, 2020.
- Fine-Grained Provenance for Matching and ETL. Nan Zheng, Abdussalam Alawini, Zachary G. Ives. ICDE 2019.
- Dataset Relationship Management. Zachary G. Ives, Soonbo Han, Yi Zhang, Nan Zheng. CIDR 2019.
- Collaborating and Sharing Data in Epilepsy Research. Joost Wagenaar, Greg Worrell, Matthias Dumpelmann, Zachary Ives, Brian Litt, Andreas Schulze-Bonhage. Journal of Clinical Neurophysiology.
- Looking at Everything in Context. Zachary Ives. Zhepeng Yan, Nan Zheng, Brian Litt, Joost B. Wagenaar. CIDR 2015.
- Approximated Summarization of Data Provenance. Eleanor Ainy, Pierre Bourhis, Susan B. Davidson, Daniel Deutch, Tova Milo. PROX. EDBT 2016: 620-623.
- Fine-grained Provenance for Linear Algebra Operators. Zhepeng Yan, Val Tannen, Zachary Ives. TaPP 2016.
Tools
Habitat Data Management Platform
The Habitat Platform is a basic data lake or “data habitat”, with a set of services for
- large object, graph, and relation stores,
- authorization and access control
- the ability to quickly retarget different underlying storage systems.
PROV Storage is a web site where users can upload provenance data in either the PROV-XML or PROV-N format. Uploaded PROV data can also be accessed via web services.
ProvenanceTracker
Tracker is a software program that captures process and file access events for provenance data. The event logs are collected in an intermediate data storage, and a user can browse and manipulate them. You can follow below instructions to install and run the program.
You can view and manipulate event logs captured by Tracker on LogViewer.
Quick Start Guide
- Download the distribution available on your platform.
- Install the distribution. You may need to see the requirements.
- Configure platform-specific policy settings: Audit Policy for Windows, audit for Linux and OSX.
- Make sure tracker.conf is ready before the program runs. You may use the one included in the distribution. For details, see configuration.
- Run the program as administrator. For console mode, see Console Mode.
- Use your PROV Storage login credentials for authentication.
PROVision Provenance Instrumentation & Analysis Tool
The PROVision tool enables you to take provenance and reconstruct missing details, for instance which input records were used to generate an output record.
Team
The Penn Provenance Team includes members from computer science, bioengineering, biology, and medicine. Key participants and collaborators include:
- Zachary Ives, CIS
- Junhyong Kim, Biology
- Susan Davidson, CIS
- Sampath Kannan, CIS
- Val Tannen, CIS
- Brian Litt, Bioengineering and Neurology
- Abdussalam Alawini, CIS
- Soonbo Han
- John Frommeyer, SEAS
- Nan Zheng, CIS
- Stephen Fisher, Biology