Databricks Introduces Data Lineage For Unity Catalogue


New data lineage capabilities give customers more transparency and proactive control over how data is used in their lakehouse

Databricks announced data lineage for Unity Catalog, significantly expanding data governance capabilities on the lakehouse. Data lineage describes how data flows throughout an organisation. Using this new feature of Unity Catalog, customers can gain visibility into where data in their lakehouse came from, who created it and when, how it has been modified over time, how it’s being used, and much more. Data lineage for Unity Catalog is now available for preview on AWS and Microsoft Azure.

Organisations deal with an influx of data from multiple sources, and understanding where that data came from, how it’s moving and changing, who has access to it, and how it’s being used is extraordinarily difficult. However, having that understanding is paramount to ensuring trust and assessing risk. With data lineage for Unity Catalog, data teams can see all the downstream consumers impacted by data changes – applications, dashboards, machine learning models or data sets, etc. –  and easily understand the severity of the impact to notify the relevant stakeholder of changes quickly.

Data lineage empowers data consumers, such as data scientists, data engineers and data analysts, to be context-aware as they perform analyses, resulting in better quality outcomes. Additionally, data stewards can see which data sets are no longer accessed or have become obsolete to retire unnecessary data, reducing risk and ensuring end users only use high-quality data. The new capabilities within Unity Catalog give businesses a complete view of the entire data lifecycle, so data leaders can understand how data is being collected, if it was updated, and the processes used.

“Governance capabilities such as data lineage are critical as we work to build the industry’s most robust lakehouse platform. Without good data lineage, it is challenging to track the business and verification processes that data-driven organisations need to succeed. Our goal is to ensure our customers can focus on insights and move toward proactive data management practices through a unified, transparent view of their entire data ecosystem,”  said Matei Zaharia, Co-Founder and Chief Technologist, Databricks.

Key features of Unity Catalog include automated run-time lineage to capture all lineage generated in Databricks, providing more accuracy and efficiency versus manually tagging data. This information is captured for tables, views, and columns to give a granular picture of upstream and downstream data flows. Additionally, lineage works across all workloads supported by Databricks including SQL, Python, R, and Scala, allowing all data personas to augment their tools with data intelligence and better insights. This includes capturing lineage for entries like notebooks, workflows, and dashboards.

Data lineage also helps organisations better meet compliance standards, making it easier to keep track of data flows that are subject to compliance regulations such as the General Data Protection Regulation (GDPR) or California Consumer Privacy Act (CCPA) or Health Insurance Portability and Accountability Act (HIPAA). This element of data traceability is a vital ingredient of a modern data architecture that allows customers to meet their legal requirements.