Dissertation proposal defense: Grigoris Karvounarakis Wednesday, 11/26 1:30 pm Levine 307 Advisors: Zack Ives & Val Tannen Committee: Boon Thau Loo (chair) Susan Davidson Steve Zdancewic Peter Buneman (external, Univ. of Edinburgh). Title: Provenance in Collaborative Data Sharing Abstract: In this dissertation we focus on recording, maintaining and exploiting provenance information in Collaborative Data Sharing Systems (CDSS). These are systems that support data sharing across loosely-coupled, heterogeneous collections of relational databases related by declarative schema mappings. A fundamental challenge in a CDSS is to support update exchange between participants, while tolerating disagreement between them and recording the provenance of exchanged data. This provenance information can be useful during update exchange, e.g., to evaluate provenance-based trust policies. It can also be exploited after update exchange, to answer a variety of user queries, about the quality, uncertainty or authority of the data, for applications such as trust assessment, ranking for keyword search over databases, or query answering in probabilistic databases. To address these challenges, in this proposal we develop a novel model of provenance graphs that is informative enough to satisfy the needs of CDSS users and captures the semantics of query answering on various forms of annotated relations. We extend techniques from data integration, data exchange, incremental view maintenance and view update to define the formal semantics of unidirectional and bidirectional update exchange. We develop algorithms to perform it incrementally while maintaining provenance information. We present strategies for implementing our techniques over an RDBMS and experimentally demonstrate their viability in the Orchestra prototype system. We propose ProQL, a query language for provenance graphs that can be used by CDSS users to combine data querying with provenance testing as well as to compute annotations for their data, based on their provenance, that are useful for a variety of applications. Finally, we outline proposed strategies for implementing ProQL over an RDBMS and indexing techniques for provenance graphs, to speed up provenance querying.