3-step framework for scaling data quality in the age of generative AI
Applying what we’ve learned from healthcare to data quality
I’ve found that data quality isn’t really about cleanliness or completeness or accuracy. Instead, it’s about trust.
A recent survey showed that though nearly every data team is diving headfirst into AI applications, 68% of companies aren’t confident in the data behind these applications.
Imagine this: someone looks at a dashboard and says, "That number doesn’t look right." Diagnosing it is a huge challenge. Are they even right? If so, where’s the problem coming from? It could be that the pipeline didn’t run, a data quality check failed, or the meaning of a metric changed and data consumers weren’t informed. Hours later, the company has lost trust in its own data and data team.
This is the data trust gap, which I’ve written about before. It stems from a disconnect between data producers, who aim to create high-quality data and data products, and data consumers, who care less about the quality and more about how usable they actually are. Between these groups lies an ever-growing mess of diverse data tools, people, and information.
The data trust problem is only intensifying today. In the age of generative AI — where algorithms not only interpret but create data — trust is the foundation for every data product. If a human sees a weird number, they can stop and investigate. But an AI will just use that number, often for critical business decisions, without hesitation.
So what does it mean to “fix” data quality and build great data products in the age of AI? I think it comes down to shared culture, context, and collaboration. Let’s dive into why in today’s issue of Metadata Weekly.
🚀 3-step framework to scaling data quality in the age of generative AI
Just like maintaining your personal health, improving data quality involves three key steps: awareness, cure, and prevention.
1. Awareness: Is our data high-quality now?
When we talk about awareness in the context of data quality, we're really discussing the need to understand our current baseline. Where does our data stand right now? Are there any glaring issues we need to address? This involves pulling in context about what’s happening, detecting anomalies, and keeping users informed — for instance, notifying them if a pipeline didn’t run.
Improving awareness means making information from the world of data producers accessible and understandable for data consumers — ideally managed by someone who understands both the technical and human side of data. It's about breaking down silos and ensuring everyone is on the same page. For example, this could involve pushing alerts directly into a BI tool or Slack channel, or using common color schemas like green, yellow, and red.
All of this context could even be used to create a data product score, which measures the quality, usability, and trustworthiness of data. I’ve actually been surprised to see how quickly this has gained traction and increased adoption among our customers.
AI tip: With AI, I’ve seen a growing importance of business terms and semantic context. Generative AI models need to grasp not just the data's cleanliness but also its meaning, relevance, and nuances within that specific business and industry. Data leaders should collaborate with business stakeholders to develop and maintain a comprehensive glossary of business terms. This collaborative effort will ensure that AI models are trained to interpret data in context, producing more accurate and actionable insights.
2. Cure: How can we make our data high-quality?
This step addresses the most broken flow in data management today.
Most teams today, such as sales or marketing, are fairly homogeneous. Everyone in the team will likely have a similar skill set and background. Meanwhile, data teams are incredibly diverse and involve people across different verticals and skill sets — data scientists, engineers, product managers, business analysts, stewards, and more.
This diversity is why solving data quality isn't just a technical problem — it's a collaboration problem. Curing data quality issues involves growing a shared understanding, awareness, and context across the entire ecosystem. This requires getting all of the people involved in data to agree on what needs to be done, then translating that agreement into the actual workflows of data producers and consumers.
One effective strategy is to develop Service Level Agreements (SLAs) — mutual agreements on how data should be handled, considering each group's needs and constraints. These agreements should ideally be created and maintained by cross-functional teams made up of data scientists, analysts, business leaders, IT professionals, and anyone else who has a stake in data quality at the company.
AI tip: It’s important to recognize the dynamic nature of data quality. In the AI world, data quality is not a static goal but a moving target. As more people and applications use data, the requirements and expectations for that data will evolve. (This is actually good because it’s a sign of becoming more data-driven!) Make sure to establish a feedback loop and use it to continuously review and update data quality metrics to align with the changing needs of the business and users.
3. Prevention: How can we ensure we always have high-quality data?
This step focuses on sustainability — how can we take what we’ve learned in the Awareness and Cure steps and implement it in a way that prevents these same issues from cropping up regularly?
To be honest, I’d like to stop talking about data quality in the next few years. The more that we can nail the prevention of data quality issues, the more we can stop getting bogged down in whether the number on a dashboard is right and focus on actually using it.
One powerful solution for data quality prevention is implementing data contracts. These establish agreements between different data stakeholders on how to handle quality checks and issues, ideally automating the process so people don’t have to focus on it constantly. The more you can automate data quality, the easier it will be for everyone.
Effectively scaling data quality initiatives requires leveraging advanced tools and technologies designed to streamline data management and improve quality. Automated data lineage tracking, anomaly detection, and data quality monitoring can significantly reduce errors and efficiently resolve issues.
AI tip: With AI applications being created left and right, it’s important to ensure that the right people get the right context at the right time in a way that's safe and secure. For example, if someone in HR asks a question to an AI, it might be appropriate for payroll data to be included in the answer. However, that same data shouldn't be used across the rest of the company. Active data governance (including automated role-based access controls, data masking techniques, and monitoring mechanisms) can help protect sensitive information, uphold data compliance, and reinforce stakeholders’ trust.
I just discussed data quality in the age of generative AI on the DataFramed podcast with Barr Moses (Monte Carlo Data) and George Fraser (Fivetran)!
📚 More from my reading list
The data engineer's guide to mastering systems design by Yordan Ivanov
The rise of AI data infrastructure by Astasia Myers and Eric Flaningam
Mastering AI department reorganizations by Elad Cohen
Reducing data questions deluge by Ergest Xheblati
Deliver on the data needs, not the data desires by Dylan Anderson
Why your generative AI projects are failing by Ben Lorica
What are integrations and how do they work? by Justin Gage
WTF is a “data team”? by Joe Reis
Don’t lead a data team before reading this by SeattleDataGuy
Top links from last issue:
What 10 years at Uber, Meta and startups taught me about data analytics by Torsten Walbaum
7 data modeling concepts you must know by Madison Mae
The danger zone in data science by Duncan Gilchrist