This expanded reading list includes many papers selected for the Red Book (Bailis et al.) as well as other papers that connect to the major research themes of the Database Group.
Systems
- Astrahan et al.: System R: Relational Approach to Database Management. TODS 1(2), 1976.
- Wong et al.: The Design and Implementation of INGRES. TODS 1(3), 1976.
- Stonebraker and Rowe: The Design of POSTGRES. SIGMOD 1986.
- Lohman et al: Extensions to Starburst: Objects, Types, Functions, and Rules. CASCON 2010.
- Li and Zhang. HTAP Databases: What Is New and What is Next. SIGMOD 2022.
- Stonebraker et al: Mariposa: A Wide-Area Distributed Database System.
- Ives et al. The ORCHESTRA Collaborative Data Sharing System. SIGMOD Record, 37(3), 2008.
- Balakrishnan et al.: Retrospective on Aurora. VLDB Journal Vol 13, 2004.
- Aref et al: Design and Implementation of the LogicBlox System. SIGMOD 2015.
- Alexandrov et al: The Stratosphere Platform for Big Data Analytics. VLDB Journal 23(6), 2014.
Beyond DBMSs
- Borgida: Description logics in data management. TKDE 1995.
- Bernstein: Applying Model Management to Classical Meta Data Problems. CIDR 2003.
- Weld: Recent advances in AI planning. AI Magazine, 1999.
- Zaharia et al: Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Engineering Bulletin, 2018.
- Atkinson et al: Scientific workflows: Past, present and future. Future Generation Computer Systems 75, October 2017.
- Qin et al. Making data visualization more efficient and effective: a survey. VLDB Journal 29, 2020.
- Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive Tasks. NeurIPS 2020.
Query Optimization
Classic dynamic programming architecture:
- Chaudhuri: Overview of Query Optimization in Relational Systems. PODS 1998.
- Selinger et al.: Access Path Selection in a Relational Database Management System. SIGMOD 1979.
Rule-based optimization and pruning:
- Graefe and DeWitt. The EXODUS Optimizer Generator, SIGMOD 1987; and Graefe and McKenna. Volcano Optimizer Generator, ICDE 1993.
- Haas et al. Extensible Query Processing in Starburst. SIGMOD 1989.
Learned query optimization:
- Markl et al. LEO — DB2’s LEarning Optimizer. VLDB 2001.
- Marcus et al. Neo: A Learned Query Optimizer. VLDB 2019.
- Marcus et al. Bao: Making Learned Query Optimization Practical. SIGMOD 2021.
- Yang et al.: Balsa: Learning a Query Optimizer Without Expert Demonstrations. SIGMOD 2022.
Multi-objective optimization:
- Papadimitriou and Yannakakis. Multiobjective Query Optimization. PODS 2001.
- Trummer and Koch: A Fast Randomized Algorithm for Multi-Objective Query Optimization. SIGMOD 2016.
Other interesting papers:
- Leis et al: How good are query optimizers, really? VLDB 2015.
- Chu et al: Optimizing Distributed Protocols with Query Rewrites. SIGMOD 2024.
Query Execution
- Graefe: Query Evaluation Techniques for Large Databases. ACM Computing Surveys, 1993.
- Kossman: Distributed query processing survey. ACM Computing Surveys, 2001.
- Mackert and Lohman: R* Optimizer Validation and Performance Evaluation for Distributed Queries. VLDB 1985.
- Babu et al: Continuous queries over data streams. SIGMOD Record, 2001.
- Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004.
- Kersten et al.: Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. VLDB 2018.
- Kersten et al: Tidy Tuples and Flying Start. VLDB Journal, Vol 30, 2021.
- A follow-up to Neumann: Efficiently compiling efficient query plans for modern hardware. VLDB 2011.
- Rieger et al: Integrating Deep Learning Frameworks into Main-Memory Databases. AIDB Workshop, 2022.
Exploration, Top-K, Pruning:
- Fagin et al: Optimal Aggregation Algorithms for Middleware. PODS 2001.
- Bancilhon and Ramakrishnan: An Amateur’s Introduction to Recursive Query Processing Strategies. SIGMOD 1986.
- Ngo: Worst-Case Optimal Join Algorithms: Techniques, Results, and Open Problems. SIGMOD 2018.
- Liu et al: Enabling Incremental Query Re-Optimization. SIGMOD 2016.
Indexing and Storage
- Stonebraker: OS Support for Database Management. CACM 1981.
- Beckmann et al: R* Tree. SIGMOD 1990.
- Hellerstein et al.: Generalized Search Trees for Database Systems. VLDB 1995.
- Rao et al. Making B+-Trees cache conscious in main memory. SIGMOD 2000.
- Abadi et al: Integrating compression and execution in column-oriented database systems. SIGMOD 2006.
- Bawa et al: LSH Forest. WWW 2005.
- Abadi et al.: Column-stores vs. row-stores: how different are they really? SIGMOD 2008.
- Kraska et al. The case for learned index structures. SIGMOD 2016.
- Nathan et al.: Learning Multi-Dimensional Indexes. SIGMOD 2020.
- Hentschel et al.: Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation. SIGMOD 2018.
- Pinecone overview of similarity search and Foundations of HNSW.
- Malkov et al: Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Recognition, 2018.
Modeling Database Instances
- Kipf et al: Learned Cardinalities: Estimating Correlated Joins with Deep Learning. CIDR 2019.
- Negi et al: Robust Query Driven Cardinality Estimation under Changing Workloads. PVLDB 16(6), 2023.
- Ma et al: Active Learning for ML Enhanced Database Systems. SIGMOD 2020.
- Hilprect et al: DeepDB: Learn from Data, not from Queries! PVLDB 13(7).
Adaptivity
- Kabra and DeWitt: Mid-Query Re-optimization. SIGMOD 1998.
- Avnur and Hellerstein: Eddies: Continuously Adaptive Query Processing. SIGMOD 2000.
- Babu et al: Adaptive Ordering of Pipelined Stream Filters. SIGMOD 2004.
- Taylor and Ives: Sideways information passing for push-style query processing. ICDE 2008.
- Bruno and Chaudhuri: Statistics on Query Expressions. SIGMOD 2002.
- Markl et al. Robust query processing through progressive optimization. SIGMOD 2004.
Concurrency
- Gray et al. Granularity of Locks and Degrees of Consistency in a Shared Data Base.
- Kung and Robinson: On Optimistic Methods for Concurrency Control.
Logging and Recovery
- Mohan et al. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.
Web and Integration
- Levy et al. Querying Heterogeneous Information Sources Using Source Descriptions. VLDB 1996.
- Papakonstantinou et al. The TSIMMIS Approach to Mediation: Data Models and Languages. Journal of Intelligent Systems, 1997.
- Deutsch et al. Physical Data Independence Constraints and Optimization with Universal Plans. VLDB 1999.
- Karvounarakis et al: Collaborative Data Sharing via Update Exchange and Provenance. TODS 38(3), August 2013.
Entity Resolution / Schema Matching
- Hernandez et al. Clio: A Semi-Automatic Tool for Schema Mapping. SIGMOD Record, 2001.
- Doan et al. Reconciling Schemas of Disparate Data Sources. SIGMOD 2001.
- Rahm and Bernstein. A survey of approaches to automatic schema matching. VLDB Journal Vol 10, 2001.
- Papadakis et al. Four Generations of Entity Resolution. Springer, 2021.
Schema Design and Evolution
- Miller et al: The Use of Information Capacity in Schema Integration and Translation. VLDB 1993.
- Curino et al: Automating the Database Schema Evolution Process. VLDB Journal Vol 22, 2013.
- Green and Ives: Recomputing Materialized Instances after Changes to Mappings and Data. ICDE 2012.
Data Cleaning and Fusion
- Fuxman et al: Conquer: Efficient Management of Inconsistent Databases. SIGMOD 2005.
- Bleiholder and Naumann: Data Fusion. ACM Computing Surveys, 41(1), 2005.
- Abedjian et al: Profiling Relational Data: A Survey. VLDB Journal Vol 24, 2015.
Views and Recursion
- Levy et al. Answering Queries Using Views. PODS 1995.
- Duschka and Genesereth. Answering Recursive Queries Using Views. PODS 1997.
- Gupta and Mumick. Magic-sets Transformation in Nonrecursive Systems.
- Gupta et al. Maintaining Views Incrementally. SIGMOD 1993.
- Halevy. Answering Queries Using Views. VLDB Journal, 2001.
Semistructured Data and Graphs
- Brin and Page: The Anatomy of a Large-Scale Hypertextual (Web) Search Engine. Computer Networks and ISDN Systems, 1998.
- Goldman and Widom: DataGuides: Enabling Query Formulation and Optmization in Semistructured Databases. VLDB 1997.
- Milo and Suciu: Index Structures for Path Expressions. ICDT 1999.
- Shkapsky et al: Graph Queries in a Next-Generation Datalog System. PVLDB 6(12).
- Deutsch et al: Graph Pattern Matching in GQL and SQL/PGQ. SIGMOD 2022.
- Han et al: Implementation Strategies for Views over Property Graphs. SIGMOD 2024.
Data Lakes
- Talukdar et al: Learning to Create Data-Integrating Queries. VLDB 1998.
- Zhu et al: Josie: Overlap set similarity search for finding joinable tables in data lakes. SIGMOD 2019.
- Zhang et al: Finding Related Tables in Data Lakes for Interactive Data Science. SIGMOD 2020.
- Khatiwada et al: Integrating data lake tables. PVLDB 16(4), 2022.
- Khatiwada et al: Santos: Relationship-based semantic table union search. SIGMOD 2023.
- Zhang et al: Searching Data Lakes for Nested and Joined Data. PVLDB 18(11), 2024.
- Arora et al: Language Models Enable Simple Systems for Generating Structured Views of Data Lakes. VLDB 2023.
ML, Embeddings, and LLMs
- Vaswani et al: Attention is All You Need. NeurIPS 2017.
- Wei et al: Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 2022.
- Yang et al: TableFormer: Robust Transformer Modeling for Table-Text Encoding. ACL 2022.
- Patel et al: ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data. SIGMOD 2024.
- Santhanam et al: ALTO: An Efficient Network Orchestrator for Compound AI Systems. EuroMLSys 2024.
- Khattab et al: DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines. ICLR 2024.
- Huang et al: From Detection to Application: Recent Advances in Understanding Scientific Tables and Figures. ACM Computing Surveys, 2024.
- Kasem et al: Deep Learning for Table Detection and Structure Recognition. ACM Computing Surveys, 2024.