Metadata for the modern data stack, data-centered culture, standardizing tooling, and more

✨ Spotlight: Modern Metadata for the Modern Data Stack

Dec 21, 2021

Welcome to this week's edition of the ✨ Metadata Weekly ✨ newsletter.

Every week I bring you my recommended reads and share my (meta?) thoughts on everything around metadata.

The holiday season is here! I am excited to see what next year brings for metadata management and its position in the modern data stack. As we wrap this year, I’d love to hear your views and predictions on the data and analytics trends you foresee in 2022.

✨ Spotlight: Modern Metadata for the Modern Data Stack

A few years ago, data would primarily be consumed by the IT team in an organization. However, today data teams are more diverse than ever — data engineers, analysts, analytics engineers, data scientists, product managers, business analysts, citizen data scientists, and more. Each of these people have their favorite and equally diverse data tools, everything from SQL, Looker, and Jupyter to Python, Tableau, dbt, and R.

The result is often chaos within the collaboration. Frustrating questions like “What does this column name actually mean?” and “Why are the sales numbers on the dashboard wrong again?” bring speedy teams to a crawl when they need to use data.

Just like data, how we think about and work with metadata has steadily evolved over the past three decades.

Last year, I spoke to over 350 data leaders to understand their fundamental challenges with existing metadata management solutions and construct a vision for modern metadata management. I like to call this approach “Data Catalog 3.0”.

Data Catalog 3.0s will not look and feel like their predecessors in the Data Catalog 2.0 generation. Instead, Data Catalog 3.0s will be built on the premise of embedded collaboration that is key in today’s modern workplace, borrowing principles from GitHub, Figma, Slack, Notion, Superhuman, and other modern tools.

Modern data catalogs are designed around four key characteristics:

1. Data assets > tables: The 3.0 generation of metadata management will need to be flexible enough to intelligently store and link all these different types of data assets in one place.

2. End-to-end data visibility, rather than piecemeal solutions: The Data Catalog 3.0 will help teams finally achieve the holy grail, a single source of truth about every data asset in the organization.

3. Built for a world where metadata itself is “big data”: Data Catalog 3.0 should be more than just metadata storage. It should fundamentally leverage metadata as a form of data that can be searched, analyzed, and maintained in the same way as all other types of data.

4. Embedded collaboration comes of age: Because of the fundamental diversity in data teams, data tools need to be designed to integrate seamlessly with teams’ daily workflow.

In the next few years, there will be the rise of a modern metadata management product that takes its rightful place in the modern data stack. I share about the evolution of data catalogs and some thoughts on the Data Catalog 3.0 era in this blog.

Share Metadata Weekly

❤️ Fave Links from This Week

Building a Data-Centered Culture at Ironclad by Jessica Cherny

“We found that establishing a data culture within each product and engineering team was the best way to start expanding our data culture. At Ironclad, we have about seven core product teams and we’re beginning to meet with each team in a monthly “data huddle” to discuss key questions and data needed to inform each team’s product feature value. Our data huddle template is provided below—Feel free to get started filling out your own template here!”

My take: I love Jessica’s post about how the Ironclad team is deliberately creating a data culture! She shared templates for how their team runs data huddles and scopes data projects using a scoping document.

I’ve come to believe that the next delta for data teams is going to come from investing in the modern data culture stack (cultural rituals that will help us diverse humans of data come together and collaborate effectively) and it’s great to see more and more data leaders start to open up and share their best case practices with the community.

How Standardized Tooling and Metadata Saved Our Data Organization by Duy Tran

“We have a diverse data ecosystem that requires us to support a wide variety of data pipelines and customizations. We focused on standardizing how data moves through our system, building a layer around all our transformations and how those transformations are orchestrated. We avoided being too prescriptive about the actual underlying technologies, letting authors continue to build Spark, SQL, or Pandas and store their data in different storage technologies.”

My take: In this blog, Duy makes so many amazing points — the data team will likely ALWAYS be diverse. We will always use diverse tools, and there will always be a diversity of personas. This is why I love the approach that KeepTrucking has taken — to not be prescriptive about the underlying technology (SQL vs Spark). Instead, they are building “shipping standards” to standardize best practices around documentation, dependency lineage, ownership, and quality, and then unifying these via a common metadata standard — which I believe will be the active metadata layer in the modern data stack.

👯 Tune In

🎧 Entrepreneur’s Handbook: I join Amardeep Parmar for a podcast to discuss values and how to establish them in your organization. Tune in here.

🎙 Analytics Engineering Podcast: Loved this final episode with Tristan Handy on one of the most loved podcasts in the analytics engineering community. Check it out here.

Context & Chaos

Discussion about this post

Ready for more?