The War of the Catalogs

Databricks Unity Catalog, Snowflake Polaris, and the future of cataloging

Aug 02, 2024

Apparently this summer is the “War of the Catalogs”.

At its annual Summit, Snowflake announced the launch of its Polaris Catalog, an open data catalog designed for Iceberg tables — earlier this week, they even open-sourced it! This allows multiple engines (currently Apache Doris, Apache Flink, Apache Spark, PyIceberg, StarRocks, and Trino) to read and write Iceberg tables into Polaris. With cross-engine read and write interoperability, Polaris provides a vendor-neutral approach to give engineers, developers, and architects more control and clarity over their Iceberg data, wherever it resides.

A few days later on the main stage of its own Summit, Databricks open-sourced Unity Catalog, its catalog for data and AI governance across clouds, data formats, and data platforms. Unity Catalog serves as a central repository for all data assets within a Databricks account, coupled with a governance framework and an extensive audit log of all actions performed on the data. Its unified governance solution aims to help technical data teams optimize their data estates and manage their data in a cohesive, centralized manner.

These announcements sparked a ton of debate in the data world. Some people framed it as a battle or war between giants of the data space. Some talked about how this will change the data catalog space, while others debated a more fundamental question — are these even catalogs?

Since I published about Data Catalog 3.0 back in 2020 (and obviously in my role as the founder of Atlan, known for our top-right position in the Forrester Wave Enterprise Data Catalogs for DataOps), I get asked about data catalogs a lot — “How is Tableau as a catalog?” or “How does Atlan compare to the Unity catalog?” They seem like simple questions, but that’s far from the truth.

Despite the buzz around catalogs these days, there’s no one idea for what a catalog is. Everyone defines the space differently, leading to arguments over whether data catalogs are dead, overhyped, or mission-critical. The answer is, all of the above — it really depends what type of catalog you’re talking about.

In today’s issue, I want to break down what a data catalog actually is. Though we talk about this as just one category, I think there are three distinct types of catalogs — technical, embedded, and universal catalogs. Here’s a glossary to help decode what each type looks like, how they work together, and where they’re all breaking down today.

🗂️ The three types of data catalogs you should know

1. The technical catalog — exposing technical metadata from one data tool

What does this look like?

A technical catalog provides comprehensive metadata and context for a single data source or data tool. Earlier examples of this are Glue for AWS and Purview for Azure, and now Snowflake and Databricks are leading the way in creating great open-source technical catalogs with Polaris and Unity Catalog.

What problem does this solve?

Companies don’t want their data services to be black boxes, so technical catalogs aim to solve the “context” problem by exposing metadata about data assets, products, lineage, tags, policies, and more for a single data source. Some, like Unity Catalog, also provide governance capabilities (such as applying access control policies on your data). (Side note: Governance is another overused “buzzy” term that probably deserves a similar article like this from me soon.)

What will happen next?

Metadata becomes the “single sign-on” for the data ecosystem, and technical catalogs are a base requirement for every tool.

I think that Databricks and Snowflake showed the natural next step for technical catalogs — surfacing their metadata to share with users or other tools. Over time, exposing high-quality metadata will likely become a staple requirement for all technical tools. This won’t just be limited to data storage and compute platforms though. It is likely to extend to tooling like ELT/ETL — for example, we saw Fivetran open up their Metadata API a couple years ago, and now other vendors in the space are following suit.

At the Databricks keynote, Ali Ghodsi made a passionate case for open data formats: “Stop giving your data to vendors. They’ll just lock you in.” In a world with skyrocketing innovation and constant change, people are tired of siloed data estates where products don’t talk to one another. A sentiment I’ve heard many data leaders echo is, “I don’t want to be locked into one vendor’s roadmap.”

While I think that open data formats is a step in the right direction (though I’m a bit bullish about Apache Iceberg and Delta), that’s not enough to help customers truly stay “locked out”. Instead, from our vantage point at Atlan, I’ve been fortunate to see the power of open metadata in connecting entire data estates from source systems to BI. Access to metadata via an open format is becoming a core evaluation criteria for companies, both to prevent silos and power more complex automated tasks.

2. The embedded catalog — the metadata consumption layer for one data tool

What does this look like?

An embedded catalog surfaces metadata to help users understand and trust data assets within a given data product. For example, dbt Explorer allows dbt users to better explore data assets in dbt. Similarly, Tableau Catalog shows context and metadata for Tableau assets, and Snowflake Horizon adds context and governance capabilities for assets in Snowflake’s AI Data Cloud.

What problem does this solve?

Data users need context about a data asset as they’re using it within a given tool. For example, if someone is using Tableau, they need to know whether the data they’re using is up to date and from the correct time period, verified as trustworthy, and so on. Without this context, they’ll just end up with dashboards that people don’t use or trust.

What will happen next?

Embedded catalogs become embedded into all data products, rather than adding yet another app to data people’s desktops.

I always like to start by thinking about where the industry is headed with two key perspectives in mind:

What’s better for the end user experience?
What’s the incentive for the product category to invest in improving the end user experience?

In this case, the answer is simple. The ideal solution for an end user is to get the context that they need, where and when they need it. I wrote about this when I introduced the concept of active metadata back in 2021.

Active metadata sends metadata back into every tool in the data stack, giving the humans of data context wherever and whenever they need it — inside the BI tool as they wonder what a metric actually means, inside Slack when someone sends the link to a data asset, inside the query editor as try to find the right column, and inside Jira as they create tickets for data engineers or analysts.

For example, our browser extension takes context from the entire data stack and makes it available directly in tools like BI and even Snowflake. However, a year after its launch, I think we’re now at a point where every tool in the data stack has an incentive to embed these catalogs in their respective products.

Data platforms are now realizing that governance and data democratization are two sides of the same coin. Most of these platforms price based on usage, and so to increase revenue per customer, they need to increase usage (aka drive democratization). But to drive democratization, we need users to trust their data and have the context to use it — enter the need for governance.

This is why we’re seeing the new wave of embedded “catalogs”. For example, rather than opening a separate tab to view context for a data tool, that context could be added to the tool itself. For example, adding better search, discoverability, and context directly within data products’ interfaces can minimize switching costs and increase trust for users. Many companies have already started down this path, but I think we’ll see even more investments in the near future.

But the question is, how does Tableau or Snowflake get the context they need from across the stack to power the metadata that enriches the end user experience? They could integrate metadata from hundreds of sources directly and invest a ton of engineering bandwidth to support these integrations… or they could simply integrate into an API from a true active metadata platform.

3. The universal catalog or “catalog of catalogs” — metadata across the data estate

What does this look like?

Think of this as the “catalog of catalogs”. A universal catalog is a centralized repository with metadata from all across the data estate. It creates universal context by ingesting metadata from every data tool (including the metadata exposed by technical catalogs or embedded catalogs). This metadata and context then exists in its own tab or software — think of Alation or Collibra.

What problem does this solve?

While some companies have simple data estates with one main data source or data tool, most are more complex. They have data flowing in from multiple sources, processed across multiple tools, visualized across an array of reports and dashboards, and turned into data products via different internal or external tools, all done by a diverse range of data users. Rather than using piecemeal catalogs for each data source or and tool, universal catalogs connect across and bring clarity to sprawling data estates.

What will happen next?

Universal catalogs are important but won’t become the be-all and end-all in cataloging.

Universal catalogs often argue that organizations will usually have multiple catalogs for multiple tools and use cases, so it’s important to bring this metadata together into one “catalog of catalogs”.

While this is an important step in the cataloging space, it’s not the be-all and end-all. I think this approach is overly simplistic and often based on poor tooling choices. Ingesting metadata from every tool (rather than having a separate technical or embedded catalog for each tool) is important, but there’s a better way. Keep reading for more details!

🚀 The natural evolution of data catalogs: the unified data and AI control plane

With the majority of data leaders focused on Gen AI this year, metadata is more important now than ever before. The problem is, none of these types of data catalogs are sufficient for today’s needs.

First, the modern data ecosystem is more diverse than ever before. From data sources such as databases and data lakes, to ingestion and processing tools like Apache Kafka and Apache Spark, to end-user tools like BI platforms and dashboards, each tool plays a specific role in the data lifecycle. Having a separate catalog to understand and manage each tool just isn’t practical. And while connecting to diverse tools (like a universal catalog does) is great, it’s limited to use cases like discovery and root cause analysis, often leaving use cases like governance, privacy and security, or quality and trust to other tools.

The users of this data stack are also as varied as the tools themselves. They include analysts, engineers, data scientists, business analysts, and financial analysts, each with their own ways of working and specific needs. For example, data engineers work within data sources and pipelines, while business analysts spend their time in dashboarding tools. A modern catalog should adapt to diverse people’s data needs and workflows, rather than expecting them to adapt to it.

Lastly, metadata itself is evolving into big data, becoming the foundation for the future of AI and LLM applications. Metadata will need to live in a metadata lake or lakehouse to power trust and context at scale. This involves collecting and ingesting metadata like normal, but then going further by using it to drive automated downstream actions — e.g. automatically deleting data based on policies or diagnosing data issues via root cause analysis. (Chad Sanderson talked recently about the importance of moving from collecting to automating metadata.)

Whether they’re technical or universal, even the best data catalogs just can’t keep up with this diversity of tools, users, and use cases. You’ll still end up with confusion and silos, with marketing and legal and sales and engineering all working in their own specialized tools.

Instead we need something beyond a data catalog. We need a neutral entity, a Switzerland of the data stack to manage context, governance, compliance, and other metadata-based challenges for diverse tools and users.

And this isn’t just me saying it. Stay tuned — there’s some big news coming next week, and I can’t wait to share it right here!

📚 More from my reading list

Who to follow in AI in 2024 by Michael Spencer
How top data teams are structured by Mikkel Dengsøe
The data professional’s cheat sheet for working with stakeholders by Jerrie Kumalah
The three biggest data problems companies face by Dylan Anderson
GPT-5: everything you need to know by Alberto Romero
Interpretable machine learning: a guide for making black box models explainable by Christoph Molnar
What advanced analytics teams are doing that you aren’t by Duncan Gilchrist and Jeremy Hermann
Beware! This SQL mistake fools even experienced data scientists by Khouloud El Alami
Data storytelling: influencing your way to business impact with Stefania Gvillo on the Driven by Data podcast

Metadata Weekly

Discussion about this post