The Data Source #26 | Rethinking Data Catalogs and Query Engines 🧙

Nov 13, 2024

Welcome to The Data Source, your monthly newsletter covering the top investment themes across cloud-infrastructure, developer tools and data.

Subscribe now and never miss an issue 🦋

The recent acquisition of Tabular by Databricks has sparked many discussions within the data infrastructure community. While most of my recent conversations with data practitioners have been fixated on the table format war (Iceberg vs Delta Lake vs Hudi 🤔) I think there's a deeper story unfolding around how we manage and query data at scale.

Beyond the Consolidation Narrative 🍫

The enterprise data landscape has evolved to require both robust metadata management and modern table format support. Major platforms showcase this duality: Snowflake combines its Polaris Catalog for governance with separate Iceberg table support, Databricks pairs Unity Catalog with native Iceberg capabilities, and Confluent offers Stream Catalog alongside Tableflow for streaming table formats.

Beyond these solutions, specialized catalog vendors like Acryl Data, Select Star, Atlan, and data.world and more focus on enhancing data discovery and governance, while open-source solutions like DataHub, Amundsen and others continue to mature.

This diversity reflects the complex reality of enterprise data needs. Different catalogs emerged from solving distinct challenges in modern data systems, each addressing unique requirements around metadata management, governance, and discovery.

The evolution of table formats like Apache Iceberg illustrates these technical complexities. While Iceberg offers transactional guarantees and efficient metadata handling for large-scale data lakes, its adoption varies based on platform architecture:

Snowflake supports Iceberg through external tables, enabling interoperability with their cloud data warehouse.
Databricks integrates Iceberg support natively within their lakehouse architecture, focusing on interoperability with their existing Delta Lake format.
Confluent's Tableflow focuses on streaming table support, optimizing for continuous data flow and schema evolution within Kafka-based pipelines.

This architectural diversity extends into two distinct query engine designs. The first, exemplified by Snowflake, Databricks and Confluent embeds metadata management directly into their query processing layer. This tight coupling enables automated collection, optimized data access patterns, and intelligent query planning using platform-specific metadata. The second, represented by Apache DataFusion and DuckDB, maintains strict separation through abstraction layers, prioritizing modularity and the independent evolution of components.

Both approaches have found success in different contexts: integrated systems have established a strong track record in enterprise deployments, while modular architectures are gaining impressive momentum in embedded and analytical applications.

The Two Paths of Data Catalog Evolution 🚵

I see the evolution of data catalogs reflecting two distinct paths. Platform-native catalogs in Snowflake, Databricks, and Confluent focusing on performance-critical technical metadata like table schemas, statistics, and access patterns. Separate discovery platforms optimizing for cross-platform lineage, business context, and collaboration - similar to how database indexes differ from library catalogs in their fundamental purpose.

Despite their technical maturity, integrating these dual architectural approaches - platform-native and independent solutions - defines the next major challenge in data infrastructure. While both paths have proven their value, enterprises now face significant technical hurdles in multi-platform synchronization, permission mapping, and metadata consistency among other challenges.

As a Seed investor, I think the next phase of innovation won't come from new catalogs or query engines, but from solutions that bridge these architectural divides. The most valuable tools will turn today's complex integration patterns into standardized, configurable solutions that work seamlessly across platform-native and independent ones.

Call for Startups 🤑

If you are a founder or data practitioner thinking about this space at all, please send me a note via email (priyanka@work-bench.com) or Twitter (@psomrah) as I’m actively looking to make an investment here!

Priyanka 🌊

…

I’m a Principal at Work-Bench, a Seed stage enterprise-focused VC fund based in New York City. Our sweet spot for investment at Seed correlates with building out a startup’s early go-to-market motions. In the cloud-native infrastructure and developer tool ecosystem, we’ve invested in companies like Cockroach Labs, Run.house, Prequel.dev, Autokitteh and others.

The Data Source

The Data Source #26 | Rethinking Data Catalogs and Query Engines 🧙

Beyond the Consolidation Narrative 🍫

The Two Paths of Data Catalog Evolution 🚵

Call for Startups 🤑

Ready for more?