Click to learn more about author Joe deBuzna.
Comparing your Data Across Multiple Source and Targets
When you have data stored in multiple applications and want to consolidate it into a central Data Warehouse or Data Lake, you need to make sure the data in the source and the target remain consistent. But while consolidating data seems like it should be easy, the process is deceptively difficult. Successfully keeping data synced requires you to consider a number of factors, including data types, character set types, and protection from human error.
Syncing Up Data Types
Data type conversions are difficult to manage. For example, converting dates from one format to another can get complicated because some technologies include a time component in a date data type, and others don’t. In addition, binary data cannot always be represented using identical data types. In order to ensure that you don’t lose any data and/or data value precision, make sure data types in your source and target are compatible.
Character Set Conversions
Character set conversions can be very complex. A dirty little secret is that some applications allow you to store incongruent character data. For example, if your database is in a Western European language using single-byte characters, it may allow you to store Chinese or Japanese double-byte characters. Although the application can convert between these character types, when you move the data to a data lake, the new tools processing the data may be unaware that the data is stored in a different format. Standardizing on an all-encompassing character set like Unicode for all data sets is a way to ensure that you can store all data irrespective of origin and character set.
Accounting for Human Error
Another issue to consider is the human factor. While the press likes to talk about the risks of hackers getting into data, the challenge is actually more mundane. Humans make mistakes. And while production databases have numerous safeguards to prevent end users from deleting data or making mistakes that can corrupt data, rarely do these safeguards entirely eliminate issues. Users can delete data without realizing it. You need to put in place protections to avoid loss of data for these accidental cases.
You have two ways to safeguard your data from human error. You can maintain copies of your data through traditional replication and through Big Data replication.
With traditional replication, you replicate all the data on the source (Point A) to the target (Point B). Whenever you do an insert or update, add a column, drop a column, or use data definition language (DDL) commands (e.g. create, modify, and remove database objects), the data is replicated over to the target. When selecting a data replication product, it is important to ask whether the tool has a data validation product that can validate the data, fix it if the data is out of sync and account for accidental data loss on the destination system.