Click to learn more about author Wayne Yaddow.
DataOps, which focuses on automated tools throughout the ETL development cycle, responds to a huge challenge for data integration and ETL projects in general. ETL projects are increasingly based on agile processes and automated testing.
ETL (i.e., extract, transform, load) projects are often devoid of automated testing. The lack of automated testing is usually due to 1) critical ETL testing functions that are not available on the market or open source, 2) the complexity of some ETL testing tools, or 3) the high cost of developing tools in-house.
Part 1 in this two-part series described what makes DataOps processes valuable for ETL projects and a driving force for ETL testing automation. Part 2 features a “roadmap” toward ETL test automation, then offers guidance to help readers determine which ETL tests should be considered for automated testing.
A Roadmap to ETL Test Automation
Automated testing can often be implemented with tools such as iCEDQ, ServiceNow, Ranorex, GitLab, Jenkins, Tosca DI, Querysurge, and Bamboo. These tools help your project team implement automated ETL testing for multiple stages/phases of the data pipeline. In addition, the use of version control systems (VCS) such as Git, CVS, or Mercurial, can aid in keeping the code associated with your data pipeline in one place. This will expedite testing, bug fixes, and implementation of new features (see Figure 1).
The decision to implement automated tools for ETL testing depends on a budget that supports additional spending to meet advanced testing requirements. It is critical to keep in mind that ETL test tools built and maintained internally are better than no test automation.
Step-by-Step Planning for ETL Project Test Automation
As with all projects, decisions made in the planning phases of a test automation project prepare the ground for success or failure. For this reason, time should be allowed to set targets, analyze current testing processes, and build the right implementation team prior to launching an ETL test automation project.
- Analyze your current ETL-related testing process – from unit tests to component tests to Data Quality tests.
- Determine the stakeholders and IT team.
- Identify and prepare test scenarios as subjects for test automation.
- Search for and choose the best commercial or open-source ETL, database management, and Data Quality automation test tools that support the technologies used in your ETL project. Prepare an in-depth evaluation of each.
- Conduct proof-of-concept exercises, preferably in conjunction with tool vendors.
- Implement the selection of automation tools.
- Set up time for training and learning curve.
- Start automation of previously documented test cases.
- Plan a review of your testing process and results – adjust as needed.
Deciding Which ETL Test Scenarios Are a Priority for Test Automation
For test automation scenarios, assess your plans and identify the best candidates for automation based on risk and value. What types of defects would cause you to stop an integration or deployment? What types of tests do you use to verify critical data and functionality? What tests cover the application domains that are historically known for failure? What tests are providing information that is not already covered by other tests in the pipeline?
Common preferences for manual ETL testing:
- Exploratory tests: These kinds of tests require the knowledge, experience, analytical/logical skills, creativity, and intuition of the tester. Human skills are needed to execute the testing process in this scenario.
- Ad-hoc testing: In this scenario, there is no specific approach. It is often an unscheduled test method where the comprehension and insight of the tester is the important factor.
Common preferences for automated ETL testing:
- Source to target data reconciliation testing (including transformation testing, regression testing, smoke testing): Here, automated testing is suitable because of frequent code changes and the need to promptly run the regression assessments.
- Repeated execution: Testing that requires repeated execution of a task is better automated.
- Load tests: Automated tests are the best choice to complete load tests in an efficient way.
- Performance testing: Similarly, testing that requires the simulation of thousands of concurrent users requires automation.
- End-to-end testing: Data testing can take time, mainly due to the multiple steps, technologies, and large volume of data involved. Each stage of the ETL tests requires various tools, strategies, and types of tests. These include individual comparisons, migrated data validations, data transformation rule validations, reconciliations – from sources to targets – Data Quality verification, and preliminary testing of client information reports.
The numerous ETL test scenarios often considered for test automation and implementation of test automation tools (commercial, open-source, and internal tools) are presented in Table 1. Using a list of test scenarios such as this one is a good start on your way to the implementation of ETL test automation.
Table 1: Test scenarios and test cases frequently considered for automated ETL testing
Conclusions
Many data integration/migration teams have found that it is possible to be successful with multiple levels and types of automated ETL testing throughout their DataOps SDLC.
Automated testing will not replace all manual unit, component, and end-to-end testing on a DataOps project. However, the emphasis on automated testing will ensure that the more expensive manual testing is focused on high-risk, high-value activities.
Creating automated ETL tests is well worth the effort, especially in data warehouse and data pipeline projects. Automated tests can be run hundreds of times at a modest overall cost with greater accuracy.
We know that testing takes time. We know that testing costs money. Where planning and implementing test automation efforts reduce time and costs, that has to be good for your organization’s bottom line.