Provenance and Citation for Data Science
In the modern era of data-driven science, it becomes essential to rigorously document how results were obtained — whether for reproducibility, data reuse, acknowledgement, or fact-checking. This motivates the student of data provenance as its own (meta)data model, with close ties to queries, analytics, machine learning, and citation.
- Digital citation (Davidson, Wu): a study of citation techniques for data and software
- PennProvenance (Ives, Davidson, Zheng, Han): managing and reconstructing provenance metadata for the sciences
- mProv (Ives, collaborators at U Memphis, UCLA, UCSF, GA Tech): provenance management for streaming device data
- Provenance for text (Ives, Roth, Zhang, Wu, collaborators at Facebook): tracking claims and quotes in the open web
Data Streams and Incremental Machine Learning Views
Many application settings, from wearable health devices to crawling the web, involve streaming data. As we combine stream computation, integration and querying, and machine learning — how do we develop updatable representations of what we currently know? In effect, we want to develop views of knowledge extracted or inferred by AI.
- Knowledge graph management (Ives, Han, Chen): storage and incremental updates for learned knowledge graphs
- Incremental machine learning (Davidson, Wu)
- Programming abstractions for data streams and ML (Alur, Ives, Hilliard)
- Transformers for time series data (Liang, Ives, and collaborators)
AI for Scalable Data Science
Modern data-driven science requires many different subtasks, ranging from data discovery to wrangling and integration, to cleaning, to building machine learning models. How can we develop platforms that simplify and automate many of these tasks?
- Juneau (Ives, Zhang, Zheng): data lake search to aid discovery in interactive data science
Distributed, Decentralized Data Analysis
How do we redesign distributed computation for modern settings, in which resources and even trust may be heavily decentralized?
- Data management for disaggregated compute clusters (Loo, Liu, Angel)
- Blockchain-enabled data management (Loo, Amiri, collaborators at UCSB, SFU, CUHK, GU)
Machine Learning for Data Analytics Systems
Can we use machine learning to improve the performance of query processing and workflows?
- Learned query optimization (Marcus)
- Learned indices (Marcus)
- Summarizing knowledge in learned query optimization (Wu, Ives)