Penn DB Group's logo
Data Provenance
Arrow; just used for page layout. People
Arrow, used for page layout Publications
Arrow, used for page layout Research
Arrow, used for page layout Classes
Arrow, used for page layout Seminar
Arrow, used for page layout Resources
Search this website

Data Provenance

Executive Summary

When you see some information on the Web, do you know how it got there? In all likelihood the information is extracted from a database, which in turn was extracted from other databases, and so on. So what you see may be many steps removed from the source and, at each step, may have been corrected, transformed, translated from a different language, censored, summarized, etc. If you are a scientist, or anyone concerned with the reliability of the data, how the information arrived at the form in which you see it ­ its provenance ­ is of crucial importance. You will want to know where and how the information was produced, who has corrected it and how old it is. If you are interested in intellectual property issues, data provenance is an essential part of understanding the ownership of data. Yet the information is often unavailable to you. Even the people who built the database may not be able to trace the provenance of their data back to its source; and there are very few tools for recording provenance in databases.

Understanding provenance of documents is not a new problem; it has occupied historians, textual critics and other scholars for centuries. The provenance of data in databases is a larger problem, because we are interested in data at all levels of granularity -- from a single pixel in a digital image to a whole database. Just as scholars comment on documents by attaching annotations (marginalia) to text, part of the solution to recording provenance is the attachment of annotations to components of databases. But databases are typically rigidly structured objects and the structure does not allow us to attach irregular data. Database researchers have recently considered more loosely structured forms of data and have developed software systems for querying and storing such data. This work is closely related to new formats that have been developed for structured documents on the Web. It is expected that this technology will provide the substrate for recording and tracking provenance. But the larger issues involve new data models, new query languages and new storage techniques.

This is a new project which under consideration for funding by the Digital Libraries Initiative, an inter-agency program sponsored by NSF, DARPA, NLM, LoC, NEH, and NASA. Part of the project is simply to identify the issues clearly, and it is an area in which fresh ideas are needed. If you would like to be involved or learn more about the project, please send us mail.

Project Web site

Related Projects

Archiving of Scientific Data

Project Members

Peter Buneman   Susan Davidson   Sanjeev Khanna   Mark Liberman   Chris Overton   Keishi Tajima   Wang-Chiew Tan   Val Tannen   


Levine Hall
3330 Walnut Street
Philadelphia, PA 19104

Last update: 08/02/11     Comments