Provenance and Citation for Data Science

In the modern era of data-driven science, it becomes essential to rigorously document how results were obtained — whether for reproducibility, data reuse, acknowledgement, or fact-checking. This motivates the student of data provenance as its own (meta)data model, with close ties to queries, analytics, machine learning, and citation.

  • Digital citation (Davidson, Wu): a study of citation techniques for data and software
  • PennProvenance (Ives, Davidson, Zheng, Han): managing and reconstructing provenance metadata for the sciences
  • mProv (Ives, collaborators at U Memphis, UCLA, UCSF, GA Tech): provenance management for streaming device data
  • Provenance for text (Ives, Roth, Zhang, Wu, collaborators at Facebook): tracking claims and quotes in the open web

Data Streams and Incremental Machine Learning Views

Many application settings, from wearable health devices to crawling the web, involve streaming data. As we combine stream computation, integration and querying, and machine learning — how do we develop updatable representations of what we currently know? In effect, we want to develop views of knowledge extracted or inferred by AI.

  • Knowledge graph management (Ives, Han, Chen): storage and incremental updates for learned knowledge graphs
  • Incremental machine learning (Davidson, Wu)
  • Programming abstractions for data streams and ML (Alur, Ives, Hilliard)
  • Transformers for time series data (Liang, Ives, and collaborators)

AI for Scalable Data Science

Modern data-driven science requires many different subtasks, ranging from data discovery to wrangling and integration, to cleaning, to building machine learning models. How can we develop platforms that simplify and automate many of these tasks?

  • Juneau (Ives, Zhang, Zheng): data lake search to aid discovery in interactive data science

Distributed, Decentralized Data Analysis

How do we redesign distributed computation for modern settings, in which resources and even trust may be heavily decentralized?

Machine Learning for Data Analytics Systems

Can we use machine learning to improve the performance of query processing and workflows?