Today’s data pipelines use transformations to convert raw data into meaningful insights. Yet, ensuring the accuracy and reliability of these transformations is no small feat – tools and methods to test the variety of data and transformation can be daunting.
Transformations generally involve changing raw data that has been cleansed and validated for use by individuals or systems. Data transformations are essential for data management, integration, migration, wrangling, and warehousing. Data transformation can be:
- Constructive – Data is replicated, added, or copied
- Destructive – Tables, records, or fields are deleted
- Aesthetic – Standardization of data is changed to meet requirements
- Structural – Renaming, combining, and moving columns is a method to reorganize data
AI models, for example, require large volumes of data, some unstructured and ungoverned. Many enterprise architectures lack a modern data quality strategy and are unprepared for AI workloads’ complexity and high computing demands. As a result, the quality and integrity of the underlying data provide outcomes that are frequently untrustworthy, unpredictable, and outdated.
Addressing the prevalent issues causing poor data quality – such as inadequate source data profiling and cleansing, poorly designed or error-prone transformations, ineffective testing, and insufficient validation – can significantly improve the reliability of data pipelines. Understanding the impact of transformation errors on data quality emphasizes the necessity for meticulous planning and execution of tests. By focusing on these areas, organizations can mitigate risks and enhance the accuracy and consistency of their data.
Common Data Transformations in Data Pipelines
Data transformations are essential to most data pipelines, ensuring that raw data is prepared and optimized for further use. These transformations help refine, structure, and enhance data, making it suitable for various analytical, operational, or reporting tasks. The following are some common data transformations:
Aggregation: For analysis and reporting purposes, raw data may need to be summarized or aggregated, for instance, to compute sums, averages, and counts.
Cleaning: Raw data is often messy. It may contain errors, duplicates, missing values, or anomalies. Transformations help clean this data by correcting or eliminating these issues.
Enrichment: Data from one source might be enhanced or enriched using data from another. This often involves combining or joining datasets together.
Feature engineering: In machine learning, raw data may not always be suitable for training models. Transformations can help create new features that better represent the underlying patterns in the data.
Formatting: Data may be transformed from one format to another to make it compatible with different tools or platforms.
Masking: To maintain privacy or meet regulatory requirements, personal or sensitive data may need to be masked or transformed to make it hard or impossible to trace back to individuals.
Normalization: Data from various sources may be in different formats or scales. Transformations can help normalize this data to a consistent format or scale.
How much of a concern is data quality for data pipeline initiatives? For example, model-centric AI techniques prioritize efficiency by optimizing model architecture and parameters. However, this strategy may overlook the data quality and relevance used to train AI models, resulting in models that lack resilience when confronted with new data. AI and machine learning models represent the data on which they are trained. The resulting model is corrupt if the quality is insufficient, or the outcome data is incomplete.
Data quality can be a showstopper for AI. Many enterprise leaders who were hot on the potential of AI are realizing that their efforts will be dead in the water if the data they are using to train and populate their AI models is inadequate, inaccurate, or not timely. (Joe McKendrick, “Taming the Data Quality Issue in AI,” Enterprise AI Sourcebook, Information Today, 2024)
Tests of Transformed Data Can Be Challenging
During data integration, there can be multiple obstacles when evaluating transformed source and target data.
Change management: Keeping track of changes in source data or transformation processes can lead to discrepancies if not appropriately managed.
Complexity of data relationships: Understanding and maintaining the relationships between data entities can be difficult, especially in complex datasets.
Data quality issues: Inconsistencies, inaccuracies, or missing values in the source data can lead to unreliable transformed data.
Data transformation logic errors: Mistakes in the transformation rules or logic can result in incorrect data outputs.
Performance bottlenecks: Large volumes of data can slow down the evaluation process, making it challenging to assess data quality promptly.
Schema mismatch: Data structure or format differences between source and target systems can complicate integration.
Testing and Validation: Testing and validating transformed data to meet business requirements and accuracy can be resource-intensive.
Addressing these challenges requires careful planning, robust data governance, and effective testing strategies.
When and How Data Errors Occur in Data Pipeline Workflows
The percentages in Chart 1 are estimated based on common industry insights and experience related to data pipeline workflows. Here’s how these estimates are generally derived:
- Data transformation errors (45%): Numerous reports and articles emphasize that transformation is one of the most error-prone stages due to the complexity of logic, schema changes, and data manipulation. Papers like “Big Data Testing Techniques” and industry blogs like Datafold’s article on testing data pipelines (Datafold report) highlight that transformation steps often introduce logic bugs, incorrect joins, and aggregation issues.
- Data ingestion errors (15%): Data ingestion issues arise from incorrect configuration of sources, network problems, and file formatting issues. Sources like Thoughtworks’ article on pipeline testing mention ingestion as a significant challenge, though less frequent than transformation errors.
- Schema mismatches (10%): Problems often arise when data schema evolves or differs from expectations, as seen in the Data Pipeline Quality study (see Cornell University arXiv Laboratory). These errors typically occur during schema validation or component testing stages.
- Data quality issues (20%): Many organizations experience quality issues such as missing or corrupt data. Such issues are common due to faulty input data, transmission corruption, or improper transformation handling.
- Integration errors (5%): Integration issues (e.g., between systems or APIs) are less frequent but still notable, mainly when using external systems like APIs.
Chart 1: Survey – A breakdown of data errors in data pipeline workflows by cause
Testing Data Transformation: A Distinct Testing Process
Training a machine learning model to test data transformations in a data pipeline involves understanding transformation logic, identifying potential errors, and leveraging labeled data to build a model that can detect these errors automatically. This approach improves data quality and ensures that transformed data is accurate and reliable. Data transformation testing focuses specifically on the correctness of transformations (see Table 1).
Data pipeline testing covers the broader workflow from data ingestion to output, ensuring overall pipeline quality and functionality.
Table 1: Examples of data transformation testing vs. data pipeline testing.
Conclusion
Part 1 of this blog provided an explanation of the basic concepts of data transformation testing. The requirements, design, and development of data transformations are primary sources of data errors. This section discussed how transformation-related errors usually rank among other errors in data pipelines and several important factors to consider when planning data transformation testing (e.g., scope of testing, primary goals, specific errors to test).
Part 2 will describe verifications for types of complex data transformations. The functions of a diverse array of data transformations are identified along with test cases and scenarios.
Part 3 will address several test planning challenges including insufficient transformation requirements documentation and the needs for data profiling before, during, and following testing.
Additionally, Part 3 will include a link to a table that lists over 50 data transformation problems along with their possible effects on data quality, machine learning (ML), and data analytics reporting. Testing and instructions for issue remediation will be included with every issue.