Today the predominant mode of interacting with data has changed: rather than working with highly controlled, regularized databases, data scientists tend to work with a variety of different data sources within computational notebook software such as Jupyter Notebook and JupyterLab. Such software allows for ad hoc discovery as well as for the creation of sophisticated data analyses and machine learning models. A key issue becomes the management of the many data products (tables, dataframes, models) produced; and there is a key opportunity to help new users understand prior best-practices in using, importing, cleaning, extracting, and analyzing datasets.
Data lakes promise to centralize and capture such data resources. However, what is missing is a means of managing the plethora of datasets and versions in the lake. Data scientists often end up doing redundant work because they have no effective way of finding appropriate resources to reuse and retarget to new applications. Data scientists need set of holistic data management tools to find, standardize, and benefit from the existing resources in the data lake.
The Juneau project addresses the following challenges:
- Effectively and efficiently storing data at many stages of processing in pipelines, such that any source, intermediate, or output representation can be searched or retrieved.
- Tools for searching for relevant tables in the data lake, based on existing data fragments, schema elements, and tasks.
- Learning from existing data and processing pipelines to promote reuse of data and processing steps, instead of having users constantly start “from scratch.”
- Inferring, based on computations performed by the user community, the best schemas and representations for sharing across subcommunities.
- Developing models and tools for retargeting data across data processing platforms and tools.
It does this by extending the data management layer underneath the popular Jupyter Notebook/JupyterLab ecosystem, which many data scientists are using.
Project lead: Zachary Ives.
- Yi Zhang
- Soonbo Han
- Raghav Vedire
- Peter Baile Chen
- Dataset Relationship Management, Zachary G. Ives, Yi Zhang, Soonbo Han, Nan Zheng. CIDR 2019.
- Demonstration Description: Finding Related Tables in the Data Lake. Yi Zhang, Zachary G. Ives. VLDB 2019. Awarded Best Demonstration.
- Finding Related Tables in Data Lakes for Interactive Data Science, YI Zhang, Zachary G. Ives. SIGMOD 2020.
- Compact, Tamper-Resistant Archival of Provenance. Nan Zheng and Zachary G. Ives. Proc. VLDB Endowment, 2020.