|
Data Provenance
|
||||||||||||||||
|
Data ProvenanceExecutive SummaryWhen you see some information on the Web, do you know how it got there? In all likelihood the information is extracted from a database, which in turn was extracted from other databases, and so on. So what you see may be many steps removed from the source and, at each step, may have been corrected, transformed, translated from a different language, censored, summarized, etc. If you are a scientist, or anyone concerned with the reliability of the data, how the information arrived at the form in which you see it its provenance is of crucial importance. You will want to know where and how the information was produced, who has corrected it and how old it is. If you are interested in intellectual property issues, data provenance is an essential part of understanding the ownership of data. Yet the information is often unavailable to you. Even the people who built the database may not be able to trace the provenance of their data back to its source; and there are very few tools for recording provenance in databases. Understanding provenance of documents is not a new problem; it has occupied historians, textual critics and other scholars for centuries. The provenance of data in databases is a larger problem, because we are interested in data at all levels of granularity -- from a single pixel in a digital image to a whole database. Just as scholars comment on documents by attaching annotations (marginalia) to text, part of the solution to recording provenance is the attachment of annotations to components of databases. But databases are typically rigidly structured objects and the structure does not allow us to attach irregular data. Database researchers have recently considered more loosely structured forms of data and have developed software systems for querying and storing such data. This work is closely related to new formats that have been developed for structured documents on the Web. It is expected that this technology will provide the substrate for recording and tracking provenance. But the larger issues involve new data models, new query languages and new storage techniques. This is a new project which under consideration for funding by the Digital Libraries Initiative, an inter-agency program sponsored by NSF, DARPA, NLM, LoC, NEH, and NASA. Part of the project is simply to identify the issues clearly, and it is an area in which fresh ideas are needed. If you would like to be involved or learn more about the project, please send us mail.
Related ProjectsProject MembersPeter Buneman Susan Davidson Sanjeev Khanna Mark Liberman Chris Overton Keishi Tajima Wang-Chiew Tan Val TannenPublications
|
||||||||||||||||
|
|||||||||||||||||