The rise of metadata lake, data mesh, data asset as a data product, and more
✨ Spotlight: The Rise of the Metadata Lake
Welcome to this week's edition of the ✨ Metadata Weekly ✨ newsletter.
Every week I bring you my recommended reads and share my (meta?) thoughts on everything around metadata.
In this edition I cover a trend I am excited about in 2022, the rise of the metadata lake. You will also find my take on some interesting articles from around the internet, a list of some dream data roles, and lastly, a data visualisation chart from the New York Times that created quite a debate on Data Twitter. Hope you enjoy these links and my weekly musings below. 👇
✨ Spotlight: The Rise of the Metadata Lake
“What does this column name mean?”
“Can I trust this data asset? Where does it come from?”
“Arrgh… where can I find the latest cleaned dataset for our customer master?”
This is the everyday chaos that data teams deal with. In the past 5 years, as the modern data stack has matured and become mainstream, we’ve taken great leaps forward in data infrastructure. However, the modern data stack still has one key missing component: context. That’s where metadata comes in.
Today, metadata is everywhere. Every component of the modern data stack and every user interaction on it generates metadata. Apart from traditional forms like technical metadata (e.g. schemas) and business metadata (e.g. taxonomy, glossary), our data systems now create entirely new forms of metadata.
And these new forms of metadata are being created by living data systems, sometimes in real-time. This has led to an explosion in the size and scale of metadata.
This is where I believe there is the need for a metadata lake: a unified repository to store all kinds of metadata, in raw and further processed forms, which can be used to drive both the use cases we know of today and those of tomorrow.
“The problem is that, in the world of big data, we don’t really know what value the data has… We might know some questions we want to answer, but not to the extent that it makes sense to close off the ability to answer questions that materialize later.” – Dan Woods in Forbes, 2011
There will always be countless tools and tech in a team’s data infrastructure. By effectively collecting metadata, a team can finally unify context about all their tools, processes, and data.
Here are the three characteristics of a metadata lake:
1. Open APIs and interfaces
The metadata lake needs to be easily accessible, not just as a data store but via open APIs. This makes it incredibly easy to draw on the “single source of truth” at every stage of the modern data stack.
2. Powered by a knowledge graph
Metadata’s true potential is unlocked when all the connections between data assets come alive. For example, if one column is tagged as “confidential”, this metadata can be used along with lineage relationships to tag all the other columns derived from that particular column as confidential.
3. Power both humans and machines
The metadata lake can be used to empower both humans (such as discovering data and understanding its context) and machines or tools (such as auto-tuning data pipelines, as mentioned above). This flexibility is a reality that needs to be reflected in the fundamental architecture.
I believe metadata lake will become the cornerstone for the next wave of innovation in the data management space. It wouldn’t be surprising to see the metadata lake power entire categories of companies that will add a layer of data science and analytics on top of metadata.
Read more about the rise of the metadata lake and the anatomy of an active metadata platform.
❤️ Fave Links from This Week
This might be the most honest, no-BS post I’ve read about the data mesh, ever! Every single thing that Thinh talks about is incredibly true. Here are my highlights:
The data mesh is not for everyone — decentralised structures only help create agility if you are of a certain size and scale.
No one technology or software can help you adopt the data mesh. The data mesh is a cultural and mindset shift, and not just the technology tools.
And my fave: Investing in embedding metadata quality and governance in the team’s daily workflows is the only way you can enable the data mesh.
“Like Security & Privacy, Data Governance must “shift-left” to become a part of daily work for every data team. As such, Data Governance concerns such as enhancing Data and Metadata quality needs to be prioritised in every data team’s backlog. To embed Data Governance into standard development processes, these activities can be raised as tickets directly to the relevant data team’s backlog or automated as tests that every code change must pass in order to be integrated and deployed to production.
To drive automation, standardisation, and best practices, you may need to establish specialist engineering teams who can develop tooling/processes and provide specialist advisory to help distributed teams meet Data Governance policies and standards more easily.”
When Should a Data Asset Become a Data Product? by Eric Weber
“What does the data asset do? This might seem like a stupid question. “It’s obvious what a metric does!” “How could you not see what a dashboard is supposed to do?” But I don’t think it is actually that obvious. Asking what a data asset does means that you need to understand what problem it addresses. Someone took the time to create the asset. Were they solving a problem that only existed in that moment? Is the “thing” the data asset does a capability that needs to live on for the organization? Understanding what the data asset does exposes if the need for that capability is recurring - one of the most important characteristics of the need for a data product.”
My take: I’ve been very bullish about the idea of data products and moving from a data service to a data product mindset, but I love the point that Eric’s bringing up here — that everything isn’t meant to be a data product! So how do you assess when you kick off a project? I love using the analogy of a traditional service company vs. a product company. Service companies start at a single problem from a single client, whereas product companies build one solution that can be reused by multiple customers/users for a similar problem.
The starting point of building a product company is the concept of getting to “product-market fit”. This starts with customer discovery and user discovery to figure out a recurring problem that multiple users have, which warrants the creation of a “product”. So, when you ask yourself if you should build a data asset as a product, ask yourself: “Is this a recurring problem that multiple users can use in my company? Would there be a “market” for my data product?” If the answer is yes, then data products might be the right route!
🐦 On Twitter
My hot take: Most data visualisations are better off as tables (with red-green shading), line charts, bar charts, or pie charts :)
Data Twitter was really torn about this one. There are honestly some good points about how this visualization is actually great because it shows off interesting dimensions using radial imagery. But here’s why I disagree: this visual, while “cool”, is honestly really hard to understand. If I (and I think I’m fairly data literate 🙂) took a couple of minutes to comprehend it, someone with no data background might have a harder time!
I firmly believe the purpose of a data visualization or insights report is to make it SUPER SIMPLE for a user to understand the data and comprehend it. In this case, I’d have preferred a few key insightful headlines with supporting line graphs or bars instead of trying to pack a bunch of insights into one visualization (albeit it is a very pretty one).
The data job market this year has started with a bang! 🥁 Curated by my partner in crime Surendran, check out the list of open roles with teams using the modern data stack:
Adam from Netlify is hiring for a Director for Data & Insights (remote) who will manage, mentor, and grow a diverse, international team across 5+ time zones and make the Data & Insights Organization more impactful.
Jeff and the data team at Wahoo are hiring a Data Engineer (remote in US time zones) who will help develop and maintain scalable ETL/ELT data pipelines.
Alexander and the Epidemic Sound team are looking for an Analytics Engineer (based in Stockholm, Sweden) who will help strengthen the company’s data pipelines.
You can check out the complete list of jobs featured in the Modern Data Jobs newsletter here.
I'll see you next week with more interesting stuff around the modern data stack. Meanwhile, you can subscribe to the newsletter on Substack and connect with me on LinkedIn here.