Data lineage can be described as a historical map of data’s journey within an organization. Use cases, in general, provide an example of how services or techniques can be used, and data lineage use cases are situations in which a form of data lineage can be used.
Data lineage tools make tracking the data’s lineage much easier. (Attempting to perform data lineage manually is tedious and time-consuming.)
Creating data lineage maps requires recording how the data has been processed, transformed, and transmitted, and is commonly used to ensure that decisions are based on accurate data. Data lineage has become an important part of intelligent decision-making for many modern businesses. However, tracking the flow of data from source to destination can be complex.
Trust of the data is based on understanding where it is from, and how it has been transformed and processed. Data lineage lets organizations see how datasets have been used and the changes made. It allows businesses to understand and correct any errors at their source. A data lineage program supports improved Data Quality.
Making good decisions that are based on data requires being able to trust the data.
An understanding of various data lineage use cases, techniques, and tools can be quite useful before hiring a contractor or purchasing software. Selecting the wrong data lineage software can result in wasted time and money, and the eventual purchase of more appropriate software. (There is a link to open-source data lineage tools at the end of this article.)
Data Lineage Techniques and Processes
Data lineage includes a variety of tools and processes. Modern businesses rely increasingly on real-time insights to improve their operations, business transactions, and customer relations, but those insights hinge upon understanding the data. There are a variety of ways managers and techs can leverage data lineage tools to improve workflows.
- Automation: Data lineage recording is an automated process. Tracking and recording the data’s lineage manually is simply not realistic by today’s standards of efficiency. Automation also makes regulatory compliance (GDPR, CCPA, etc.) much easier by marking data as personal identifiable information (PII), allowing it to be designated confidential or private.
- Data tagging: Data that has been transformed or moved is tagged. The tags are then tracked, from start to finish, providing a lineage representation. This tag tracking process, however, requires a “consistent transformation tool” that controls all data movement. (The consistency of this tool is important – get the best tool and fit for your system.) Some tagging management tools can be found here.
- Parsing: This process tracks the data, capturing changes as the data is processed, altered, and moved. This form of data captures changes across different systems. However, parsing requires a good understanding of the tools and programming languages used throughout the data’s lifecycle. This version of data lineage relies heavily on reading the logic (the process of analyzing a string of symbols) used when processing data.
- Metadata source reviews: Data lineage tools use metadata to create a visual representation of the data flow, which allows users to see how data is transformed and used throughout the organization. Data lineage tools and applications are available that can verify metadata sources, and be used to identify the reasons for bugs and errors within the data sets.
- Pattern-based lineage: Instead of working with code, this form of data lineage records patterns. Pattern-based lineage relies on metadata to develop the lineage patterns. The primary advantage of this technique is there is no need for programming languages to process data. The data is monitored, not the language algorithms.
Use Cases for Data Lineage
While data lineage is essential to the development of high-quality data, there are a variety of data lineage use cases and methods. Data lineage use case models each provide different services.
- Data issues analysis: The source of data issues or problems can be traced quickly and efficiently using data lineage records. It can locate data issues by tracing how the data flows through the organization’s systems. This process can be particularly useful in complicated data environments using data collected from a variety of systems. (It can also record data changes over time, detecting anomalies or patterns that might indicate a problem.)
- Data cleaning: Data lineage can also be used to clean up a data system by archiving data, or deleting it because it is old or irrelevant. (This can improve the overall performance of a data system by reducing the volume of data being stored.)
- Compliance: Data lineage can be used to ensure compliance issues, improve risk management, and ensure that data is processed and stored using Data Governance guidelines.
- Data Modeling: Organizations can use data lineage for data modeling purposes. It can provide the information needed to present visual representations of differing data components and their connections. The connections between the data components can be shown in the model to show the dependencies present throughout the data ecosystem.
- Data Quality: Data is constantly being altered. A data lineage system provides detailed information showing the lifecycle and location of data sources. It tracks the data from its source and allows businesses to examine its transformations and trustworthiness.
- Finding errors: Data lineage can be used to identify incorrect assumptions about the data, locating the error and allowing it to be corrected.
- Impact analysis: Data lineage can be used to help create detailed impact assessments. The upstream and downstream effects caused by any specific modifications can be detected through the use of data lineage.
- Data migration: When data is moved to a new or different storage system, organizations use a data migration process to express the data’s locations and lifecycles. Because data lineage presents how the data flows through an organization, it can be used to help in the planning phase of data migrations and upgrades.
- More efficient DataOps: A better data understanding of the data’s lineage helps to streamline data operations and avoid errors.
Real-World Data Lineage Use Cases
Some larger businesses, such as Airbnb, Netflix, UBS, Slack, and Postman, are now using data lineage – and increasing profits in the process. These companies did their research and decided it was worthwhile to install data lineage software. They required reliable data for good decision-making. Data lineage provides the visibility needed to deal effectively with data migrations, system updates, and errors, ensuring data integrity throughout the data’s lifecycle.
The data lineage industry is fairly new, and as a consequence is still a little on the expensive side. A number of large businesses have hired contractors to develop and install data lineage systems.
One real-world example of a data lineage use case is British Airways and its response to a data breach. In September 2018, British Airways was hit with a data breach affecting 380,000 customers regarding their credit card and personal info. Using data lineage, they could trace the breach to a malicious script on their website. By tracing data back to the script, they identified and repaired the issue quickly.
Another real-world data lineage use case is Air France, which faced problems with data processing and data segregation. Their business had grown to the point where they were processing over 2.5 million new visitors on their website, and keeping track of all that data from their various databases had become very difficult. Air France partnered with Talend, and developed a new data lineage system that allowed the company and their data scientists to deliver personalized advertising and real-time updates. They did this without breaching GDPR regulations.
Want to learn more about leveraging data lineage? Here are a few standardized data lineage programs. There are also some open-source data lineage programs.
Image used under license from Shutterstock.com