The dark underbelly of your data ecosystem

✨ Spotlight: The 68% of your company's data that never gets used

Nov 16, 2022

While we like to think that we cherish every piece of data, that’s far from the truth. Just like chips and children, we all have favorites.

https://media.giphy.com/media/FfZNB4NRumHyJLN5sh/giphy.gif

I recently came across the word “dark data”, referring to data that never actually gets used, which apparently makes up most data at companies today. One report found that only 32% of company data is used, while 68% goes unleveraged.

No more, we say! In today’s Metadata Weekly, we’ll dig into what dark data is, why it happens, and how metadata can step in to save the day.

✨ Spotlight: Keep your data from going to the dark side

Gartner defines dark data as “the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships, and direct monetizing).”

Many organizations collect, tag, bookmark, and store data for future insights. However, it ends up unused and eventually becomes stale — resulting in dark data.

Compliance requirements (e.g. HIPAA and GDPR), marketing and sales campaigns, and customer information (e.g. call records, video presentations, digital content, etc.) are particularly prone to creating dark data.

💸 The cost of dark data

Data storage: In 2019, Netflix was spending $9.6 million per month to store its data in AWS. If 68% of that data goes unused, it was spending about $6.5 million per month on dark data.
Data breaches: When you don’t understand where all your data is or what it contains, you leave yourself open to security threats and breaches. In 2020, Equifax had to pay $1.38 billion as a settlement for a class action lawsuit due to a data breach incident.
Data regulations: Large banks spend around $88 million to collect and store customer data for compliance and regulatory purposes. Most of this data is never used but still has to be stored properly.
Data ROT: When your data is unorganized and opaque, it’s easy to accidentally use ROT (Redundant, Obsolete, Trivial) data. Bad data caused an average $15 million in losses per year in 2018.

💪 How to discover dark data with active metadata

Ready to manage and reduce your dark data — or even just figure out how much you have? Dealing with dark data is a three-step process:

First, figure out where you stand by increasing visibility into your data. With a third-gen data catalog, you can dig into key types of metadata to figure out how bad your dark data actually is.

Staleness: Rank data assets by when they were last updated or modified. If data hasn’t been touched in a while, it’s probably going stale.
Popularity: Rank data by its popularity, or how often it is used. A low popularity score can indicate untrustworthy or unimportant data assets.
Provenance: Show how downstream and upstream applications are using stored data assets. If there are no pipelines writing to or reading from an asset, maybe it’s not worth paying for. (One of our customers cut $50,000 in storage costs just by finding and removing an unused BigQuery table. 🤯)
Quality: Rank data by its quality score or metrics. Low quality data assets — e.g. ones with null or duplicate values, incorrect patterns, missing data, etc. — are candidates for getting fixed or discarded.
Redundancy: Identify redundant copies of data in multiple systems. This is an easy way to reduce costs and delete duplicate data.
Classification: Identify any unclassified, untagged, or unlabeled data assets. These can be fixed to help people use them (especially for sensitive data) or deleted entirely.

Second, let tech do the heavy lifting, rather than acting on this information manually. Active metadata platforms are great for this sort of tedious work. They can be used to create custom, metadata-driven automations, freeing up data teams to focus on the work that actually matters.

Here are some examples of how active metadata can be used to fight dark data:

Regularly give each asset a custom relevance score or freshness status based on query logs, updates, and metadata
Automatically purge stale assets based on the number of users and frequency of use
Detect and remove duplicate assets with lineage analysis and data diff comparisons
Reduce pipeline costs by optimizing data processing for unused or less-used data
Test data for quality and compliance before it passes through a data pipeline
Track how data is being used with a holistic, sliceable reporting view of your data landscape
Personalize data workspaces to make it easier for people to find and use the right data
Establish approval workflows on how data is used, and track activity to prevent data spillage

Third, support these efforts with an anti-dark-data culture. Evangelize the importance of keeping purging stale data, avoiding data duplication, and upholding great data hygiene.

This is no easy task, but a big part is democratizing your data and preventing data silos with a third-generation data catalog. After that, building a great data culture can help you set rituals that support data productization and cleanliness.

This may seem like a lot of work for some neglected data, but like time, data is money. Ignoring over half of your data increases storage costs, wastes your data team’s time, decreases the quality of data work, and opens your company to massive security risks and fines. The metadata you need to find and eliminate dark data is already there — all you have to do is leverage it.

🌴 Metadata in action: root cause analysis

🥁🥁🥁 We’re excited to introduce a new section for Metadata Weekly — metadata in action! We’ll highlight real use cases of active metadata here each week, starting with root cause analysis.

As you probably know, answering the simple question “That number doesn’t look right” isn’t always so simple. In many cases, getting to the root of that issue can be a long complex process, which involves a multitude of different tools, systems, and people along the way.

For instance, if something looks off in a dashboard, a data analyst would go into the dashboard to confirm this, then potentially go into Jira or Confluence to see if an issue has been reported. If not, they’d then Slack a data engineer, who would dive into their Snowflake and dbt instances.

So how can you speed up root cause analysis? One company reduced a six-hour process down to just 10 minutes with Atlan’s lineage! That’s 98% of time saved every time they got that “simple” question.

Learn more about how lineage can speed up root cause analysis:

📚 More from my reading list

The important purple people outside the data team by Mikkel Dengsøe
Manifesto for the data-informed by Julie Zhuo
Data mesh: making climate data easy to find, use, and share by Eric Broda
The fight for controlled freedom of the data warehouse by Barr Moses
An open letter to data ninjas by Ananth Packkildurai

See you next week!

P.S. Liked reading this edition of the newsletter? I would love it if you could take a moment and share it with your friends on social.

P.P.S. Kudos to Srinivasa Raghavan for co-writing this dark data content. 🙌

Metadata Weekly