Navigating Data Quality Issues? You Need Data Lineage

July 7, 2023

It’s not just you, data pipelines break across every type of organisation. The average organisation experiences about 70 data issues per year for every 1,000 tables in their environment.

What separates the high-performing data teams is what happens after the incident. Some catch and resolve these issues quickly before they are noticed by internal or external data consumers. Other teams only notice after they have received a nasty email and take days or even weeks to determine the root cause of the issue.

This is problematic because the consequences of bad data escalate based on who discovers the issue. The longer the issue lingers, the greater the chance it negatively impacts the business in a big way – for example, when customers or the media discover it.

Data lineage is a map that traces the connections between your data assets – typically pipelines, transformation models, tables in the data warehouse or lake, and BI tools. This visualisation helps data teams understand the journey data takes from when it is first ingested until it is consumed.

It’s a critical capability because robust data lineage can dramatically lower a data team’s time to resolution and the number of data incidents. It does this by quickly tracing data incidents to the root cause furthest upstream and by surfacing the data assets that are most important to the business, allowing data teams to prioritise their effort and resources accordingly.

However, to navigate data quality issues efficiently, you must draw the right map. Here are four must-have data lineage features, whether you build or buy your solution.

Data lineage must be automated

Believe it or not, some data teams used to draw their data lineage map manually. In fact, most will still have a high-level overview of the integrations across their modern data stack at the tool level.

While that can be helpful from an overall onboarding or executive briefing perspective, static depictions of data lineage are worthless for the level of detail for troubleshooting data issues. There are just too many moving parts with high degrees of interconnectivity constantly being modified by multiple parties.

Luckily, data lineage can be automated. The metadata that illustrates how each table is connected to one another can be pulled and parsed from the SQL logs of a data warehouse or the metastore of a data lake. As the relationships between the assets changes, so too does the metadata, which can be used to update the lineage map automatically.

Data lineage must be end-to-end

It’s not just the relationship between tables and how data moves within the data lake or data warehouse that matters however. Today, modern data platforms are built with integrations across several layers, including ingestion, transformation/orchestration, storage, and visualisation.

Data lineage should provide the complete picture end-to-end across these layers because changes in one system can impact how data behaves in the other. For example, an analytical engineer could accidentally introduce bad code while modifying a dbt model to transform a data set which could create data anomalies within the data warehouse.

Understanding how data assets are connected end-to-end at the iBI layer, you can also understand how data assets are connected to data consumers. This allows you to understand the impact of issues or changes to specific data assets. It can also help with strategic planning when reorganising the data team in a more decentralised structure within a data mesh.

Data lineage must be at the field-level

Sometimes data engineers, analysts, or other data team members need to drill down to understand the provenance of a particular field. With particularly large data sets, some fields may be more reliable or relevant than others.

You could write a series of SELECT TOP FIVE queries just to explore tables to determine which fields are reliable or you could leverage field-level lineage to review the upstream tables of a specific field.

Table lineage can reveal some upstream tales with a few different fields on which a report depends, but field-level lineage can pinpoint the singular field in the singular table that impacts the one data point in a report that you happen to care about at that moment in time. That greatly accelerates your team’s efficiency.

Data lineage must have context

Data lineage is important for understanding the connections between each asset, but it must go further and help data teams understand the context around each asset. For example, once you’ve traced a data issue to the most upstream table you need to understand:

Who owns this table?
How frequently is it used?
Has this table had issues before?
What recent changes have been made to the code generating this table or to the queries being run on it?
Is it a pipeline issue or is the problem with the data itself?

There are a few different types of solutions that illustrate data lineage, but they differ on the context they provide. Data catalogues contain helpful information on how the data is used within the organisation as part of their lineage offerings, emphasising governance and a shared understanding. Data observability also focuses on how the data is consumed but with an emphasis on the data health and other root cause analysis context.

In the end, it’s not lineage if it’s not useful

Data lineage is rapidly evolving, with technology taking many different shapes and forms. However, a map is only helpful if you understand where you want to go. Determine the end goal or the initiatives already in place that will benefit from data lineage.

Some common use cases beyond accelerated data anomaly resolution include:

Better knowledge management of key data assets
Expanding access via a self-service data initiative
Ensuring a common understanding of metrics and terminology
Creating more accountability and ownership within the data team
Understanding data flows to help the transition to a data mesh organisational structure

With this understanding, you’ll be ready to sail even the stormiest of seas (or data lakes).

If you liked reading this, you might like our other stories
NaaS: Pulling The Levers Towards Sustainable Success
The Rise Of Botnet And DDoS Attacks

Navigating Data Quality Issues? You Need Data Lineage

It’s not just you, data pipelines break across every type of organisation. The average organisation experiences about 70 data issues per year for every 1,000 tables in their environment.

Data lineage must be automated

Data lineage must be end-to-end

Data lineage must be at the field-level

Data lineage must have context

In the end, it’s not lineage if it’s not useful

Latest Posts

OpenAI’s o3-Pro Is Here; Open-Weights Model Delayed

Mistral AI Unveils Its First Reasoning Model

Meta’s Zuckerberg Hiring for New ‘Superintelligence’ AI Team: Report

Apple Says AI Models Collapse When Facing Hard Puzzles

Meta in Talks to Invest in Scale AI

Reddit Sues Anthropic Over Alleged Data Scraping for AI Training