|
|
SHARQ
Executive Summary
Over the past decade, biological research has been transformed from a
science of the "small" to a science of the "large". Fueled by novel
technologies capable of producing massive amounts of data for a single
experiment, scientists are faced with an explosion of information
which must be rapidly analyzed and combined with other data to form
hypotheses and create knowledge. Thus a number of new research
challenges have arisen in data modeling and data integration that
must be solved to further biological as well as other scientific
research.
A major challenge lies in effective sharing of information among
collaborating, yet autonomous, parties.
They are characterized by having a diversity of perspectives (and
hence heterogeneous schemas), dynamic data, and the possibility of
intermittent connectivity or participation. The parties are peers in the
sense that they are fully autonomous, they contribute and use resources as
they choose, and they may join or leave at any point.
SHARQ (Sharing Heterogenous and Autonomous Resources and Queries) aims to develop generic tools and
technologies for creating and maintaining
confederations whose purpose is distributed data sharing that is, data cooperatives.
In response to the difficulties outlined, our solution emphasizes
- (1) decentralization for both scalability and flexibility,
- (2) incremental development of resources such as schemas, mappings between different schemas, and queries,
- (3) rapid discovery mechanisms for finding the resources relevant to a topic,
and
- (4) tolerance for intermittent participation of members and for approximate consistency of mappings.
SHARQ is a collaborative work with two biological partners: the Computational Biology and Informatics Laboratory, leaded by Chris Stoeckert,
and the Pew project group leaded by Pete White from the Children hospital of Philadelphia. We propose to develop a specific data
cooperative as a biological testbed for evaluating the proposed technologies.
More precisely we introduce briefly two modules of SHARQ: Orcherstra and SHARQ Guide.
The Orchestra system (Ives et al, 2005, Taylor et al, 2006, Green et al, 2006) is the core engine of SHARQ.
Orcherstra builds upon concepts from the Piazza peer data management system (PDMS) (Halevy et al, 2004 & 2005).
Orchestra supports the exchange of data and updates among cooperating, heterogeneous databases, making use of
policies to quickly and automatically manage disagreement among conflicting data.
Knowing what information is available in the peer network may be difficult to determine.
SHARQ Guide is therefore being designed to enable biologists to find relevant information
within a peer data management system. It provides assistance not only for users who ask queries,
but also for owners of peers who wish to be registered within the Guide.
Key ideas of the SHARQ Guide include:
- (i) Representing biological entities and relationships as a graph, following the approach of BioGuide (Cohen-Boulakia et al, 2005) (http://bioguide-project.net).
This graph can be extended in a collaborative way by the peer administrators.
- (ii) Expressing queries (a) without having to know/cite the schemas to use for querying (transparent queries) and
(b) using query schema templates.
- (iii) Proposing new features to maximize the amount of data returned to the user, by allowing some fields in the query to be optional.
- (iv) Helping administrators of peers to register their schema.
Some references
BioGuideSRS: Querying Multiple Sources with a user-centric perspective
[.pdf] BioInformatics (2007)
Sarah Cohen Boulakia Olivier Biton Susan Davidson Christine Froidevaux
Reconciling while tolerating disagreement in collaborative data sharing
[.pdf] Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD) (2006)
Nicholas Taylor Zachary Ives
Path-based systems to guide life scientists in the maze of biological data sources.
[.pdf] Journal of Bioinformatics and Computational Biology 4:5 (October 2006), pp. 1069-1095 (2006)
Sarah Cohen Boulakia Susan Davidson Christine Froidevaux Zoe Lacroix Maria-Esther Vidal
SHARQ Guide: Finding relevant biological data and queries in a peer data management system
[.pdf] International Workshop on Data Integration in the Life Sciences (DILS),
Data Integration for the Life Sciences, Poster proceedings (Selected for oral presentation). (2006)
Sarah Cohen Boulakia Olivier Biton Shirley Cohen Zachary Ives Val Tannen Susan Davidson
Orchestra: Rapid, collaborative sharing of dynamic data
[.pdf] Biennial Conference on Innovative Data Systems Research (CIDR) (2005)
Zachary Ives Nitin Khandelwal Aneesh Kapur Murat Cakir
Schema mediation for large-scale data sharing
[.pdf] International Conference on Very Large Databases (VLDB) (2005)
Alon Halevy Zachary Ives Dan Suciu Igor Tatarinov
A User-centric Framework for Accessing Biological Sources and Tools.
[.pdf] International Workshop on Data Integration in the Life Sciences (DILS),
Data Integration for the Life sciences, Lecture Notes in Bioinformatics (LNBI), Num. 3615, pp. 3-18. (2005)
Sarah Cohen Boulakia Christine Froidevaux Susan Davidson
Schema Mediation in Peer Data Management Systems
[.pdf] International Conference on Data Engineering (ICDE) (2003)
Alon Halevy Zachary Ives Dan Suciu Igor Tatarinov
Project Members
Zachary Ives Val Tannen Susan Davidson Sarah Cohen Boulakia Olivier Biton Nicholas Taylor Todd J. Green Grigoris Karvounarakis Shirley Cohen
Funding
This material is based upon work supported by the National Science Foundation under Grants No. 0513778, and 0477972.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
|