Data integration challenges are becoming more difficult as the volume of data available to large organizations continues to increase. Business leaders clearly understand that their data is of critical value but the volume, velocity, and variety of data available today is daunting. Faced with these challenges, companies are looking for solutions with a scalable, high-performing data integration approach to support a modern data architecture. The problem is that just as data integration is increasingly complex, the number of potential solutions is endless. From DIY products built by an army of developers to out-of-the-box solutions covering one or more use cases, it’s difficult to navigate the myriad of choices and subsequent decision tree.
Many questions arise in the process such as:
- How do I keep my total cost of ownership (TCO) down as I modernize?
- Does my proposed solution offer me the functionality I need now? What about two years from now? Five years?
- Will my system offer data reliability and data quality?
- How easy will this new architecture be to manage?
- How quickly can employees be onboarded?
- Will my new approach help us arrive at mission-critical business objectives with more efficiency and speed?
- How can I simplify the complexity of my current system without sacrificing performance?
As you look for more modern, streamlined data Integration, keep these 10 criteria in mind when evaluating various solutions.
1. Support for multiple sources and targets: If your source or target changes as your use cases evolve over time, you should be able to build on top of your current solution using the same platform, same user interface, and same team of people with ease and scalability. Not all modern source systems can seamlessly connect with your data integration tools. Make sure that you find a solution with a large library of pre-configured sources and targets, and the capability to connect to new sources quickly.
2. Exactly-once semantics: Capturing data only once, not less than or more than, is often an under-appreciated but very important component of a data integration solution. Exactly once is difficult to achieve, and often overlooked if organizations don’t need it at the moment. Let’s say you are looking at website views that are being captured at 1,000,000 views per second. If you lose 1% of those views in your data tracking because you lack exactly-once in your functionality, it may not be crisis-causing. However, if you are a bank looking to track malicious transactions, and you are only catching them 99.9% of the time, you will inevitably face the consequences from unhappy customers.
When data pipelines break and you have to go back in time looking for when and what you need to recapture, exactly-once can be an incredible asset to ensure data accuracy and reliability.
Not everyone can guarantee exactly-once, especially end to end, so look for a solution that provides an exactly-once guarantee to future-proof your architecture. When implemented correctly you can ensure that your data and analytics teams are looking at reliable, accurate data and making decisions based on the full picture, versus potential speculation.
3. Modern, efficient change data capture: A streaming architecture is not complete without change data capture (CDC), a methodology rather than a technology, that is a low-overhead and low-latency method of extracting only the changes to the data, limiting intrusion into the source by continuously ingesting and replicating. There are many ways to effectively CDC, depending on the use case, such as log parsing, logical decoding, triggers, and more, so you want to ensure that your solution can CDC in various ways from multiple sources to ensure successful data capture – also known as a multi-modal CDC approach.
4. ETL designer: If your workflow requires not only simple replication, but joins, aggregations, look ups, and transformations, you should be able to drag and drop using an ETL designer to drive scalability and flexibility. Build pipelines quickly, apply the appropriate functions required, change them as needed, and have easy access to replicate your work in other areas of your architecture. A well-built ETL designer will also afford your team faster onboarding and execution.
5. Ease of use and no-code UI: You should have a very intuitive user interface (UI) under a single pane of glass where you can achieve multiple use cases. If you were starting today with Oracle to SQL replication, and then your next use case is DB2 to Snowflake, you should be able to repeatedly leverage the same UI without having to train multiple people. Additionally, with multiple capabilities under one platform (i.e., streaming ETL and ELT, CDC, batch ETL and ELT), you are future-protected as new use cases arise.
6. Semi-structured data parsing for downstream application consumption: If you have a JSON or XML data type embedded in your database, you should be able to flatten that data structure and pull out the required column values so that your data is easily consumed by downstream data applications.
7. Streaming sources, not streaming targets only: Elements such as Kafka, Kinesis, and Event Hubs should be considered viable sources within your architecture and easily accessed by your data integration solution. You will want the ability to take the data from streaming sources and move it to your eventual targets. Your critical components in the solution should not have a single point of failure.
8. Scalability and high availability built in: You should be able to linearly scale by simply adding new nodes to accommodate increased workloads. Critical components in the solution should not have a single point of failure – there should be multiple instances of those components so that if one instance goes down, your system can self-recover and heal. This is vital from an enterprise, operational point of view.
9. On-premise, on-cloud, hybrid, or SaaS deployment: The choice of what type of deployment should be up to you, not to your vendor. Look for solutions that have multiple offerings to best meet your data and your organization’s needs for data privacy, connectivity, functionality, and budget.
10. Multi-tenancy: Using the same resource pool of your cluster, you should be able to logically separate the sources and targets for those that require it. Many times with sensitive data, not all members of an organization should have access to that data in its full format. The ability to create job-based data silos maintains data privacy. For example, with payment card industry data, only those who really need to see the data should have the ability to do so.
Some solutions force users to spin up multiple instances to create multiple tenants, resulting in duplication of environment management and added resources. Look for a solution that allows the system admin to create a tenancy for various lines of business and users by exploiting the underlying resources of the cluster, versus multiple instances.
Utilizing these 10 essential criteria as a checklist for evaluating data integration solutions will help organizations make the best choice and implement a system that allows them to fully use all the data that is available to them and move their businesses forward.