Special edition: 19 gotchas to look out for when evaluating data lineage
💥 The ultimate guide to lineage: free ebook, our favorite links, and more
Looks familiar? We just sent a version of this out, but instead of linking you to our brand new ebook, we accidentally linked it to a random page on our website. Apparently, our brains took off today for an early holiday. 😅
Sorry for the double email — this one is the real deal.
Lineage, lineage, lineage... 😅
The holy grail is end-to-end lineage that connects your source systems (databases, SaaS tools, etc.) and maps data flows to your final usage layer, BI tools. Achieve this and you’ll finally know how data flows across your ecosystem — one of the most impactful tools a data team can have for problems like impact analysis, data observability, root cause analysis, and cost optimization.
Yet, in my opinion, creating data lineage is one of the most complex implementation problems in the data stack.
Why? Because data lineage is an edge-case problem.
Fivetran wrote an extremely nuanced blog in 2021 titled “How we built the most reliable data pipeline ever”.
“Here’s the central lesson we’ve learned: You can’t build a data replication solution once and expect it to work reliably forever, because source systems are too complex. APIs break or work unexpectedly, and there are so many edge cases that only time — and a lot of customers — can help you find and address them. On top of that, source systems continuously evolve and challenge us to adjust to those changes to make sure the replication works quickly and reliably.”
Data lineage suffers from the same challenges as data replication — complex source systems, APIs that break or work unexpectedly, warehouses with slight variations in SQL logic, and BI tools that model data assets differently (think LookML). Not to mention the edge cases in how every team writes code or models their data pipelines.
We’ve found that for lineage, the devil is in the details. It’s not enough to get a “lineage tool”. Do you need…
column-level lineage or table-level lineage?
cross-system lineage or lineage that’s isolated in the data warehouse?
SQL parsing for generating lineage?
parser support for
MERGE
,INSERT INTO
, andUPDATE
statements, in addition to the usualCREATE
statements?
On this holiday week, we’re devoting the issue to the thorny, complex challenge of lineage. Keep reading for our brand new lineage ebook, lots of links, and a lineage-driven “Metadata in Action” video.
✨ Spotlight: 19 questions and gotchas to look for when evaluating lineage
For the past few months, Mark Pavletich and Swami Kumar from our team teamed up to review data from all our data with hundreds of data teams. They’ve identified the 19 questions and gotchas that anybody evaluating data lineage should know, bundled in what is IMHO the most comprehensive guide to evaluating data lineage.
Keep reading for a snippet of 5 of those questions and a link to the full ebook. 👇
1. Which types of SQL statements are supported?
Most lineage tools include automated SQL parsing, which ensures that your lineage graph includes data from systems without a lineage API.
Most SQL parsers support SQL CREATE
and, in some cases, MERGE
statements. However, many don’t support INSERT INTO
and UPDATE
statements. These account for most transformations in data warehouses, so they are important for full lineage coverage.
Look for lineage tools that can also parse MERGE
, INSERT INTO
, and UPDATE
statements.
2. Does it offer lineage down to the column level?
Table-level lineage is considered “table stakes”, but column-level lineage should be too. It’s crucial for a range of use cases:
Tracing sensitive data classifications for transformed PII data
Impact analysis from things like schema changes
Root cause analysis — e.g. investigating why a dashboard looks off by tracing a BI field to upstream columns in the data warehouse
Without the ability to dive into granular columns or field lineage, data engineers and analysts may miss key depth during their investigations.
Look for a native column-level experience in the UI, including viewing graph linkages at the column level.
3. Does it support field-level lineage for BI dashboards?
Anyone doing root cause analysis needs to dive into an incorrect field (i.e. dimension, measure, calculated field, etc.) in the dashboard, and work backward to zero in on the upstream fields or columns that are broken. This is only possible with field-level lineage for the BI tool.
Field-level lineage is also important for impact analysis. If a data engineer is trying to make a schema change, they need to understand the specific downstream columns and fields that will be affected — not just which dashboards will be affected in some unspecified way.
Some platforms support lineage for a few fields but don’t go deep with BI fields that are crucial for this type of analysis.
Look for two key features:
Coverage of both column-level lineage for SQL sources and BI field-level lineage.
Which BI objects are supported and exposed in the lineage for your BI tool. (E.g. in Looker, will lineage cover all the fields/objects you care about, such as Dashboards, Looks, Explores, Tiles, Fields, and Views?)
4. Does it incorporate other types of metadata to give additional context for assets in the lineage graph?
In isolation, lineage only tells part of the story and, therefore, only provides part of the value. Lineage becomes actionable when it’s combined with key metadata and context:
Operational metadata: How and when were assets orchestrated?
Quality and anomaly metadata: What state are the assets in? Are they reliable?
Business/semantic metadata: How do the assets link to key business terms or KPIs?
Owner and expert metadata: Who should you contact or collaborate with during troubleshooting?
Social metadata: What is the human context for this asset — e.g. relevant Slack discussions or Jira tickets about the asset? This is what machines alone will miss.
Tools often usually provide lineage graphs as a siloed view. Without the other metadata for these assets, it can be hard to put lineage in context.
Look for three key features:
Openness: An “open by design”, an extensible platform where you can harvest data and metadata from any source via APIs (including custom-built connectors).
Flexibility: Support for a wide range of technical, operational, anomaly/quality, and business/semantic metadata from these sources.
Personalization: A personalized data experience, where each persona sees the metadata that is right for them, rather than drowning in all the metadata.
5. Can it be used not just to investigate issues, but also to drive action programmatically?
In addition to enabling data people’s work, lineage can also enable automated system actions and workflows.
For example, if an upstream table has data quality issues, it’s important to automatically add announcements to downstream BI dashboards. This keeps business users from creating “Garbage In, Garbage Out” analysis, and saves data analysts and engineers from manually sending alerts or warnings.
Some platforms don’t have the underlying architecture and scalability to perform automated actions based on lineage.
Look for open APIs, the ability to build or customize automated workflows, and the ability to read metadata-change events and trigger changes in linked assets across the lineage graph.
Read the full ebook with all 19 questions and lots more detail.
🌴 Metadata in action: Using data lineage to drive root cause analysis
ICYMI: Last week, we introduced a new section to Metadata Weekly — metadata in action, where we highlight real use cases of active metadata.
As anyone who works with data knows, answering the question “That number doesn’t look right” ) is far from easy. While I guess you could do root cause analysis without lineage, you don’t want to! Lineage lets you zoom into all the key data, context, and changes across a diverse set of tools and systems. One company reduced their six-hour RCA process to just 10 minutes with Atlan’s lineage. 🤯
For those who missed this video from last week, learn how great data lineage is the key to faster, easier root cause analysis. Stay tuned next week for brand new Metadata in Action video!
📚 More from my reading list
Learn more about data lineage with our favorite recent lineage links:
The many layers of data lineage by Borja Vazquez
Untapped potential of data lineage by Petr Janda
Building and scaling data lineage at Netflix to improve data infrastructure reliability, and efficiency by Di Lin, Girish Lingappa, and Jitender Aswani
Data lineage, the lost child of data science by Bernard Willer
Data lineage: State-of-the-art and implementation challenges by Dion Ricky
Wishing all of you a week full of happiness and all the pie you can eat 💙
Happy Thanksgiving, and see you next week!
P.S. Liked reading this edition of the newsletter? I would love it if you could take a moment and share it with your friends on social.