Click to learn more about author Kevin Petrie.
While few technologies sit still, Data Architecture is especially dynamic. Open source innovators and vendors continue to create new Data Lake, streaming and Cloud options for today’s enterprise architects and CIOs. With business requirements also evolving, placing solid strategic bets has rarely been more difficult.
It’s therefore no surprise that enterprise IT organizations keep re-thinking their choice of Data Platforms and Data Integration methods. Effectively managing data today requires applying consistent principles to a fast-changing set of options.
Common Data Integration Use Cases
To understand the nature of what is changing, let’s first consider the most common Data Integration use cases, and the enterprise motivations for addressing each. Most projects entail at least two of the following four use cases:
- Data Lake Ingestion: Data Lakes based on HDFS or S3 have become a powerful complement to traditional Data Warehouses because they hold and process higher volumes of more data types.
- Cloud Migration: resource elasticity, cost savings and reduced security concerns have made the cloud a common platform for analytics.
- Database Transaction Streaming: businesses need to capture perishable data value, and streaming new/changed business records real-time to analytics engines makes this possible. In addition, sending incremental updates in this fashion eliminates the need for batch loads that disrupt production.
- Data Extraction and Loading from Production Sources: revenue, supply chain and other operational data from production systems such as SAP, Oracle and mainframes hold a mountain of potential analytics value. This is especially true when the data is analyzed on external platforms and mixed with external data such as clickstream or social media trends.
Take the example of a managed health services provider that is an ideal case study for multiple use cases. These include Data Lake ingestion, database streaming and production database extraction. Using Change Data Capture (CDC) software, this provider non-disruptively copies millions of live records each day as they are inserted or updated on the production iSeries system. They are then publishing these records real-time from a DB2 iSeries database to Kafka, which in turn will go through Flume to feed their Cloudera Kudu columnar database for high performance reporting and analytics.
This provider might also decide to run analytics on Hive – which effectively serves as a SQL Data Warehouse within their Data Lake – depending on what it learns about the analytics workload behavior. So while the company has made several architectural changes to put a new data pipeline in place already, more changes are likely.
Another example is a Fortune 500 food maker that is addressing all four of the use cases above. This company is feeding a new Hadoop Data Lake based on the Hortonworks Data Platform. It copies an average of 40 SAP record changes every five seconds, decoding that data from complex source SAP pool and cluster tables. Their CDC software injects this data stream along with periodic metadata and DDL changes to a Kafka message queue that feeds HDFS and HBase consumers that subscribe to the relevant message topics.
Once the data arrives in HDFS and HBase, Spark in-memory processing helps match orders to production on a real-time basis and maintain referential integrity for purchase order tables within HBase and Hive. As a result, this company has accelerated sales and product delivery with accurate real-time operational reporting. But again, change is continuous: the company is now moving its data lake to an Azure cloud environment to improve efficiency and reduce cost.
Guiding Principles for Changing Environments
These companies, like many others, are making strategic choices about Data Platforms and Data Integration methods based on the best information available at each point in time. To navigate changing and complex options, these companies and others are adopting five consistent guiding principles.
- Test small workloads and datasets to start: You cannot know how well Kudu or Hive will support your queries until you run them. Once you do, you might decide to move to an alternative. By testing new workloads on new platforms before migrating wholesale, you can avoid sunk costs down the road.
- Maintain platform flexibility with your Data Integration solutions. The inevitable trial and error process leads naturally to the need for flexibility in Data Integration processes. It pays to invest upfront in consistent data integration processes that can dynamically add or remove end points.
- Reduce developer dependency. Rising Data Integration demands, and the frequent need for change, runs the risk of overburdening ETL programmers and making them a bottleneck. The more enterprises automate, the more they empower architects and database administrators to integrate data quickly without programmers.
- Consider multi-stage pipelines. The most effective Data Lakes are really becoming canals, with locks that transform data in sequential stages to prepare for analytics. If need be, you can “rewind” to earlier stages to change course or correct errors.
- Keep your Data Warehouse. Many organizations are starting to treat their Data Lakes as transformation areas that siphon prepared data sets into traditional Data Warehouses for structured analysis. Their ACID-compliant structures often yield the most effective analytics results.
There is no decoder ring for architecting your data environment. But enterprises that follow these five principles are best positioned to cost-effectively achieve their desired analytics and business outcomes.