Welcome to this week's edition of the ✨ Metadata Weekly ✨ newsletter.
Every week I bring you my recommended reads and share my (meta?) thoughts on everything metadata! ✨
If you’re new here, subscribe to the newsletter and get the latest from the world of metadata and the modern data stack.
The Great Data Debate: Unbundling or Bundling?
Okay, it's been a crazy week in Data Twitter with another new hot debate. As I lay tossing and turning in bed last night thinking about the future of the modern data stack, I couldn’t help feel the pressure to write an opinion piece ;)
If you have been MIA last week, Gorkem Yurtseven kickstarted this debate with an article on The Unbundling of Airflow.
“Before the fragmentation of the data stack, it wasn’t uncommon to create end-to-end pipelines with Airflow. Organizations used to build almost entire data workflows as custom scripts developed by in-house data engineers. Bigger companies even built their own frameworks inside Airflow, for example, frameworks with dbt-like functionality for SQL transformations in order to make it easier for data analysts to write these pipelines.”
As one of the data teams back in the day that ended up building our own dbt-like functionality for transformations in R and Python, Gorkem’s words hit home. Gorkem makes an excellent point — Airflow’s purposes have been purpose-built into ingestion tools (Airbyte, Fivetran), transformation layers (dbt), and reverse ETL (Hightouch and Census).
Unfortunately, this has led to a crazy amount of fragmentation. I joke about this a lot, but honestly, I feel TERRIBLE for someone buying data technology right now. The fragmentation and overlaps are mind-blowing for even an insider like me to fully grasp.
Nick Schrock from Dagster wrote a response blog titled the “Rebundling of the Data Platform ” which broke the data community (again).
“I don’t think anyone believes that this is an ideal end state. The post itself advocates for consolidation. Having this many tools without a coherent, centralized control plane is lunacy, and a terrible endstate for data practitioners and their stakeholders.”
The problems that Nick pointed out are spot on. Ananth Packkildurai added, “MDS is a set of vendor tools that solve niche data problems (lineage, orchestration, quality) with the side effect of creating a disjointed data workflow that makes data folks lives’ more complicated.”
So... what’s my take? Where are we headed? 🤔
There are two kinds of people in the data world – those who believe in bundling and those who think unbundling is the future. I believe that the answer lies somewhere in the middle. Here are some of my predictions/ takes:
1. There will absolutely be more bundling from our current version of the modern data stack.
The current version of the modern data stack, with a new company launching every 45 minutes, is unsustainable. We’re absolutely in the middle of the golden era of innovation in the MDS, funded quite generously by Venture Capital $$ — all in search for the next Snowflake. I’ve heard stories of perfectly happy (data) product managers in FAANG companies being handed millions of dollars to “try out any idea” they have in mind.
This euphoria has had big advantages. A ton of smart people are solving data teams’ biggest tooling challenges. Their work has made the modern data stack a thing. It has made the “data function” more mainstream. And, most importantly, it has spurred innovation.
But, honestly, this won’t last forever. The cash will dry up. Consolidation and M&A will happen. (We’ve already started seeing glimpses of this with dbt’s move into the metrics layer, and Hevo’s move to introduce reverse ETL along with their data ingestion product.) Most importantly, customers will start demanding less complexity as they make choices about their data stack. This is where bundling will start to win.
2. However, we never will (and shouldn’t ever) have a fully bundled data stack. Diversity is always going to be a reality.
Believe it or not, the data world started off with the vision of a fully bundled data stack. A decade ago, companies like RJ Metrics and Domo aimed to create their own holistic data platforms.
The challenge with a fully bundled stack is that resources are always limited and innovation stalls. This gap will create an opportunity for unbundling, and so I believe we’ll go through cycles of bundling and unbundling.
That being said, I believe that the data space in particular has peculiarities that make it difficult for bundled platforms to truly win. My co-founder Varun and I spend a ton of time thinking about the DNA of companies. We think it’s important — perhaps the most important thing that defines who succeeds in a category of product.
Let’s look at the cloud battles. AWS, for example, has always been largely focused on scale — something they do a great job in. On the other hand, Azure coming from Microsoft has always had a more end-user-focused DNA, stemming from its MS Office days. It's no surprise that AWS doesn’t do as well in creating world-class, user experience–focused applications as Azure, while Azure doesn’t do as well in scaling technical workloads as AWS.
The only reality in the data world is diversity — data engineers, analysts, analytics engineers, data scientists, product managers, business analysts, citizen data scientists, and more. Each of these people have their own favorite and equally diverse data tools, everything from SQL, Looker, and Jupyter to Python, Tableau, dbt, and R. And data projects have their own technical requirements and peculiarities — some need real-time processing while some need speed for ad-hoc analysis, leading to a whole host of data infrastructure technologies (warehouses, lakehouses, and everything in between).
The DNA of the companies building technology for each of these different personas and use cases are different. For example, a company building BI should be focused on the end-user experience. A company building a data warehouse should be focused on reliability and scaling.
This is why I believe that bundling is likely to happen in spaces where the fundamental DNA of successful companies is similar. For example, we will likely see data quality merge with data transformation, and potentially data ingestion merges with reverse ETL.
3. Metadata holds the key to unlocking harmony in a diverse data stack.
The only reality in the data stack is diversity (and change).
While we’ll see more consolidation, the fundamental diversity of data is never going away. There will always be use cases where Python is better than SQL, and real-time processing is better than batch (and vice versa).
If you understand this fundamental reality of what it means to be a data professional, then you stop searching for a future that claims that we’ll have a perfect “bundled data platform” — and instead find ways for our unbundled data stack to work together, in perfect harmony.
Just because data is chaos, doesn’t mean that work needs to be.
We believe that the key to helping our data stack work together is in activating metadata. We’ve only scratched the surface of what metadata can do for us, but using metadata to its fullest potential can fundamentally change how our data systems operate.
Today, metadata is used for (relatively) simplistic use cases like data discovery and data catalogs. We take a bunch of metadata from a bunch of tools and put it into a tool we call the data catalog or the data governance tool! The problem with this approach is that it basically adds one more siloed tool to an already siloed data stack.
Instead, take a moment and imagine what a world could look like if you could have a Segment or Zapier-like experience in the modern data stack — where metadata can create harmony across tools and power perfect experiences.
For example, one use case for metadata activation could be as simple as notifying downstream consumers of upstream changes.
A Zap-like workflow for this simple process could look like this. 👇
When a data store changes:
Refresh metadata: Crawl the data store to retrieve its updated metadata.
Detect changes: Compare the new metadata against the previous metadata. Identify any changes that could cause an impact — adding or removing columns, for example.
Find dependencies: Use lineage to find users of the data store. These could include transformation processes, other data stores, BI dashboards, and so on.
Notify consumers: Notify each consumer through their preferred communication channel — Slack, Jira, etc.
This workflow could also be incorporated as part of the testing phases of changing a data store. For example, the CI/CD process that changes the data store could also trigger this workflow. Orchestration can then notify consumers before production systems change.
In Stephen’s words, “No one knows what the data stack will look like in ten years, but I can guarantee you this: metadata will be the glue.”
📘 More from My Reading List
Is Snowflake a database? by Natty (Jonathan Natkins)
Danger Zone: Inconsistent Metrics at Work by Lauri Hänninen
Launching and Scaling Data Science Teams: Three Years Later by Ian Macomber
Data, engineers, and designers: How the US compares to Europe by Mikkel Dengsøe
Why Data Engineers Must Have Domain Knowledge — And How To Gain It by Zach Quinn
Data as a Code – Cutting Things Smaller by Sven Balnojan
🗓️ Upcoming Events
✨ Subsurface: The Cloud Data Lake Conference by Dremio on 2-3 March 2022. I am super excited to hear talks from Olya Tanner (Census), Ryan Blue (Tabular), Jacek Soubusta and Martin Svadlenka (GoodData), Philip Portnoy (Wayfair), and the Founder Panel! Check out the agenda.
P.S. I’m also excited to share our learnings from leveraging DataOps principles to build India’s national data platform.
💫 Data Council on 23-24 March 2022 in Austin. I am super excited about the line-up, and to finally be able to attend an in-person conference and meet fellow humans of data! The agenda also looks great.
If you haven’t yet, bookmark my data stack reading here.
I'll see you next week with more interesting updates from the modern data stack! Meanwhile, you can subscribe to the newsletter on Substack and connect with me on LinkedIn here.
As the author of Building a data warehouse:
https://dlthub.com/blog/first-data-warehouse
I thoroughly enjoyed reading this week's edition of the ✨ Metadata Weekly ✨ newsletter! Your insights into the current fragmentation and future trends in the modern data stack were spot on.
Reading about the complexities of managing multiple specialized tools reminded me of the solution we discussed in our own blog, where we emphasized the power of a unified approach with dlt for custom ingesters. It's fascinating to see how your discussion about metadata activation aligns with our vision of simplifying and future-proofing data workflows. Indeed, leveraging metadata can create a more harmonious data environment, much like how dlt facilitates database agnostic schemas to enable seamless migration.
Thank you for sharing such thought-provoking content—it’s a great complement to our discussions on simplifying data infrastructure for long-term scalability!
Cheers,
Aman Gupta,
DLT Hub team