Research Projects – Penn Database Group

Machine Learning for Data Management & Integration

Many application settings, from wearable health devices to crawling the web, involve streaming data. As we combine stream computation, integration and querying, and machine learning — how do we develop updatable representations of what we currently know? In effect, we want to develop views of knowledge extracted or inferred by AI.

Knowledge graph management (Ives, Han, Chen): storage and incremental updates for learned knowledge graphs
Incremental machine learning (Davidson, Wu)
Programming abstractions for data streams and ML (Alur, Ives, Hilliard)
Transformers for time series data (Liang, Ives, and collaborators)

AI for Scalable Data Science

Modern data-driven science requires many different subtasks, ranging from data discovery to wrangling and integration, to cleaning, to building machine learning models. How can we develop platforms that simplify and automate many of these tasks?

Juneau (Ives, Zhang, Zheng): data lake search to aid discovery in interactive data science
AIRFoundry (Ives, Wagenaar, Guntuku, Lee, Weisman): enabling scientific discovery for RNA synthesis
Efficiently storing and querying property graphs (Ives, Han, Makhijani), for provenance and other applications

Distributed, Decentralized Data Analysis

How do we redesign distributed computation for modern settings, in which resources and even trust may be heavily decentralized?

Data management for disaggregated compute clusters (Loo, Liu, Angel)
Blockchain-enabled data management (Loo, Amiri, collaborators at UCSB, SFU, CUHK, GU)

Machine Learning for Data Analytics Systems

Can we use machine learning to improve the performance of query processing and workflows?

Learned query optimization (Marcus, Ives, Yi)
Learned indices (Marcus)
Incorporating knowledge into learned query optimization (Wu, Ives, Marcus)

Provenance and Citation for Data Science

In the modern era of data-driven science, it becomes essential to rigorously document how results were obtained — whether for reproducibility, data reuse, acknowledgement, or fact-checking. This motivates the student of data provenance as its own (meta)data model, with close ties to queries, analytics, machine learning, and citation.

Digital citation (Davidson, Wu): a study of citation techniques for data and software
PennProvenance (Ives, Davidson, Zheng, Han): managing and reconstructing provenance metadata for the sciences
mProv (Ives, collaborators at U Memphis, UCLA, UCSF, GA Tech): provenance management for streaming device data
Provenance for text (Ives, Roth, Zhang, Wu, collaborators at Facebook): tracking claims and quotes in the open web