This year, at VLDB 2025, Penn will be well-represented with a variety of papers.

CausalMesh: A Causal Cache for Stateful Serverless Computing: Haoran Zhang (University of Pennsylvania); Shuai Mu (Stony Brook University); Sebastian Angel (University of Pennsylvania); Vincent Liu (University of Pennsylvania). In stateful serverless computing, workflows are broken into functions that may run on different physical machines, each with its own local cache. This distribution can lead to consistency errors, where one function reads stale data from its cache because a previous function in the same workflow wrote an update to a different machine’s cache. To solve this, researchers at Penn and Stony Brook developed CausalMesh, a novel caching system that guarantees “causal consistency,” ensuring operations are seen in a logical, cause-and-effect order across all machines. A key innovation of CausalMesh is that it provides this guarantee for most read/write operations without requiring costly coordination between servers or aborting transactions. As a result, CausalMesh delivers lower latency and higher throughput than existing approaches, enabling faster and more reliable state management in serverless applications.

A Practical Theory of Generalization in Selectivity Learning: Peizhi Wu (University of Pennsylvania), Haoshu Xu (University of Pennsylvania), Ryan Marcus (University of Pennsylvania), Zack Ives (University of Pennsylvania). This research provides a theoretical understanding of machine learning models used for query optimization in databases. While these models perform well in practice, there has been a significant gap in explaining why they work, especially when they encounter new or different queries (“out-of-distribution” or OOD) than those they were trained on. The paper bridges this gap by establishing the first theoretical guarantees for how these models generalize to OOD queries. Based on these new insights, the authors developed practical strategies that significantly improve the accuracy and real-world performance of existing models on unseen query types, making them more robust and reliable without sacrificing their original performance.

Holistic query Approximation via RL Modeling. Susan Davidson (University of Pennsylvania), Tova Milo (Tel Aviv University), Kathy Razmadze (Tel Aviv University), Gal Zeevi (Tel Aviv University). To accelerate slow queries during data exploration on large databases, researchers at Tel Aviv University and Penn have developed HARLM, a novel system for approximate query processing. While existing methods speed up aggregate queries (like COUNT or AVG) by using data samples, they fail to support non-aggregate queries that retrieve specific rows. HARLM presents a holistic solution by using Reinforcement Learning to identify an optimized, smaller subset of the data that works for both query types. This approach effectively learns to create a representative data sample that maximizes query accuracy while dramatically reducing execution time. Experiments show that HARLM significantly outperforms baseline methods, improving result accuracy by 30% and providing a 10-35x speedup.

SHARQ: Explainability Framework for Association Rules on Relational Data: Hadar Ben‑Efraim (Bar-Ilan University), Susan B. Davidson (University of Pennsylvania), Amit Somech (Bar-Ilan University). Association rule mining is a widely used technique for discovering patterns (e.g., “customers who buy X also buy Y”) in large datasets. However, a major challenge has been to quantify the actual importance of an individual data element, like “X,” to the entire set of rules it participates in. This paper introduces SHARQ, a novel method that uses Shapley values, a concept from cooperative game theory, to fairly and accurately measure the contribution of each element. While a naive calculation would be exponentially slow, the researchers developed highly efficient algorithms that compute this score in near-linear time. This breakthrough makes it practical to rank data elements, entire rules, and even attributes by their influence, providing a powerful new tool for explaining and gaining deeper insights from mined data.

Data-Agnostic Cardinality Learning from Imperfect Workloads: Peizhi Wu (University of Pennsylvania), Rong Kang (ByteDance);Tieying Zhang (Bytedance), Jianjun Chen (Bytedance), Ryan Marcus (University of Pennsylvania), Zack Ives (University of Pennsylvania). The authors, at Bytedance and Penn, have developed a new system called GRASP for cardinality estimation, a crucial task in database query optimization. Traditional methods need direct access to data, which is often restricted, while existing learning-based approaches struggle with the incomplete and imbalanced query workloads found in real-world scenarios. GRASP is a data-agnostic system specifically designed for these imperfect conditions. It uses a novel compositional design that allows it to generalize to new queries and is robust to skewed training data. By effectively modeling data distributions and join correlations without seeing the underlying data, GRASP consistently outperforms other query-driven models and, remarkably, can even match or exceed the accuracy of traditional methods that have full data access.

Categories: eventspapers