Click here to learn more about Joe deBuzna.
Many of us have experienced moving to a new home or city. And in any moving process, it’s common to end up with missing valuables or broken items that leave you wondering if you should have packed better, picked a different moving company, or just thought through the integrity of your valuables before, during, and after the move. In my experience working in the data integration and replication space, many customers share similar concerns when moving their data. How can you trust the integrity of the data being moved and delivered from its “home” to your business users?
When your cloud data movement projects involve hundreds of gigabytes of data per day, latency and data integrity can appear to be at odds. However, most enterprises fueled by data-driven decisions just can’t accept this perceived trade-off. For example, in a use case of moving financial data into a cloud-based data lake like Amazon S3, low latency and high fidelity are equally critical and can’t compete for priority.
Considering that the success of data movement projects depends on the integrity of the data being delivered, let’s examine the reasons Data Quality could be compromised within a data pipeline, data validation solutions, and things to look out for when evaluating data validation options.
Data Types
The most common root cause of databases — on either end of data movement — being out of sync (let’s call them source and target from here on out) is how data is handled by the endpoint databases. Take trailing white spaces, for example. How does each database type handle them?
When the source is an Oracle database with column type as VARCHAR2, and the target is a SQL Server with column datatype as a TEXT, trailing white spaces on the source are trimmed on target by default. If your downstream applications rely on a number of characters or bytes of storage of these values, this trailing white space — or lack thereof — will show a differential output when pointing to different databases.
Another example is the scale and precision differences supported by database drivers that can result in different extents of rounding between source and target column values. Customers/applications sensitive to rounding differences will find this an important requirement for accuracy in reporting.
No PK/UI
When replicating changes from source to target databases, uniquely identifying the row(s) that change with each operation or a set of operations within a transaction requires a primary key (PK) or unique index (UI) in order to consistently apply changes to the destination database.
While some replication solutions can handle replication of tables without PK/UI, those solutions are clear about the possibility of inconsistent updates due to the lack of a unique identifier to the row being transacted on. That means there’s less certainty that the row changes accurately propagated to the destination database.
Table Structure Differences
Table structure differences arise when schema changes are not automatically applied to the destination databases. For example, when a column is added by an application user but isn’t automatically propagated to the destination database, subsequent changes involving the newly added column are not captured as part of ongoing replication. This leaves your source and target databases out of sync. It’s the same with a dropped column; if your data replication solution can’t automatically adapt to table structure changes, your source and target will almost always be out of sync from the moment these schema changes occur.
Missed Transactions
Mature data replication solutions have guardrails in place to prevent missing transactions and data. Once in awhile, user error causes incorrect points of recovery after a failure, incorrect resetting of the checkpoints from which changes are expected to be replicated, or an inconsistent reconciliation point between initial data loads and replication. Also, certain databases, specifically the SQL Server database, truncate database logs periodically, and if your replication solution isn’t current at the point of truncation, there’s a high likelihood of missed transactions for log-based CDC solutions.
As you dig through various data movement use cases, there are many reasons why source and target databases could fall out of sync. Regardless of the cause of the sync issue, your replication solution needs to be able to give you confidence in the quality of your data. So, let’s explore data validation options available in popular replication solutions.
Keep in mind, as a customer, your unique needs require unique solutions. I’m just giving you food for thought when looking at your replication vendors and their capabilities around data validation.
Built-in
It’s imperative that your replication solution provider offers a data validation solution that’s built into core replication. Replication solutions that require you to make a separate purchase simply to validate data — that their solution should have moved with zero issues in the first place — defies realistic expectations.
Some popular data replication solutions require you to purchase a separate data validation solution that’s difficult to set up, more expensive than replication itself, and, most importantly, requires separate administration and maintenance. Solutions like this incur a high total cost of ownership, a long time to market, and friction at the time of adoption.
Zero Impact
Application databases transacting with end-users directly impact your bottom line. When a database under critical transaction loads is being replicated into your enterprise data lakes for real-time data delivery, your replication solution should be able to perform data validation without taking the database/replication offline.
This capability requires innovative features that can account for changing data that hasn’t yet been applied at the target. Also, features are required that can clearly delineate data that’s out of sync versus what’s being transacted upon. Your data validation solution should be able to support “online” data validation.
Scheduling
At the heart of any data validation query is a data retrieval executed on the source and target databases, with some level of filtering, stitching, and sorting to be able to check for any data differences across endpoints.
The inherent nature of these activities is to consume resources on the source and target databases. As the user, you have to find a good time to schedule such activities in order to minimize the impact on the database resource consumption.
Recovery and Restart Ability
When replicating tables in the order of terabytes and verifying data validity across WANs, it’s imperative that your data validation solution is able to recover and restart with the least amount of overhead.
For example, when large tables are being compared across WAN for hours at a time, and an unexpected network outage occurs, your replication vendor should be able to support stateful data validation jobs so that recovery and restart ability are much more efficient when compared to stateless data validation routines.
This capability is really efficient when network bandwidths are premium and somewhat intermittent.
Options Matter
Finally, options and details matter. While data replication solutions sometimes offer data validation capabilities, they typically have a bulk option. All you get is a checksum on both source and targets and a binary output of “yes in sync” or “no, not in sync.” When databases are in sync, this bulk option is quite handy for a quick check, but when they’re not … how do you determine which rows are out of sync, why they’re out of sync, and how to fix them?
In your data validation solution, look for both bulk/brute force quick options along with row by row/slow surgical options. When you’re able to pipe one row at a time and clearly identify out of sync rows and mismatched columns, your solution provider should have the capability to surgically repair the rows across heterogeneous databases.
There are also some solutions that approach data validation as more of a metadata comparison — for example, a summary report on row counts, last transaction timestamps, and the last user accessed for source and targets. While such reports are interesting and informative, they don’t save you time. When many rows are out of sync, you still have to find the proverbial needle — in many haystacks.
So, no matter how robust your data replication solution is, there will be cases where data integrity during data movement can be compromised.
Data that isn’t trusted isn’t used, and data fidelity is essential to helping your business trust the data you provision for them. That means you need to evaluate replication solutions that offer a built-in, robust, comprehensive, and in-depth feature set.