An emerging discipline called DataOps is taking a page out of the DevOps playbook. The latter is all about accelerating the time to software delivery, and the former about accelerating data flow. And just as DevOps replaces the waterfall method of software delivery, DataOps replaces the waterfall approach of delivering data to the data consumer.
DataOps was first introduced in Gartner’s 2018 Hype Cycle for Data Management. Gartner describes it as:
“A collaborative Data Management practice focused on improving the communication, integration and automation of data flows between data managers and consumers across an organization. The goal of DataOps is to create predictable delivery and change management of data, data models and related artifacts. DataOps uses technology to automate data delivery with the appropriate levels of security, quality and metadata to improve the use and value of data in a dynamic environment.”
The concept is still evolving, Gartner says, but it really is “a new way of working and collaborating,” according to a blog by Gartner research VP Nick Heudecker. He explained further that there are no standards or frameworks for DataOps yet, and that best practices and next practices need to arise to flesh out the concept.
Currently there is a less than one percent adoption rate for DataOps as a practice, noted Dan Potter, data expert and VP of Product Marketing at Data Integration and Big Data Management company Attunity. “DataOps is not a technology but it requires the right technology, people, and processes to accelerate time to delivery and agility to respond,” he said.
When it comes to people, DataOps extends to anyone who handles data, connecting across silos and functions such as engineering, data scientists, analysts, and business users.
As it relates to data processes, Eckerson Group said that building, changing, testing, deploying, running, and tracking new and modified functionality all are part of data pipeline management.
“It also needs to manage all the artifacts — code, data, metadata, scripts, metrics, dimensions, hierarchies, etc., that these processes generate,” Eckerson said. “And it needs to coordinate the data technologies and provision and monitor development, test, and production processes.”
The Integration Core of DataOps
The five key technology that lie behind DataOps requirements, as Potter sees it, are connected to modern integration. They are:
- Continuous
- Universal
- Automation
- Agility
- Trust
“There are increasing bottlenecks between data and having to shape and format it so that those analyzing it can make decisions,” Potter said.
Addressing them requires continuous integration to make the move from batch orientation to real-time data flow for shorter cycles. It encompasses the overall task spectrum of supporting realtime streams of data and metadata, with minimal impact using transaction logs with agents, optimized for every source and target.
“Data needs to flow and to be integrated where and when it’s needed,” with the help of technology that supports a wide variety of source systems and targets — that is, it’s universal, he explained.
Shorter iterative cycles occur when automation removes as much scripting as possible, adapting multistage data processing without coding.
Agility relates to users being sure that they can move quickly to do the things they want to do. “So, as architectural stacks change, users can move quickly to use them, too,” Potter said. From on-premise to cloud, and data lakes joined with cloud data warehouses, architectures in motion are future-proofed.
To add new sources, a model-driven approach that accommodates easy input and the ability to apply changes is important. As source systems change, those changes are streamed in real-time to data warehouses, data marts, and so on. “Metadata becomes real essential to the last mile,” Potter said. The trust component refers to the data consumer being able to find the right source of information, curated by IT or other users, and be cognizant of the legacy of that data.
That is, they don’t want to lose the link between data provenance and the changes data goes through. It’s about making sure that users know where the data came from, how it was transformed, and its movement along the way.
Taking Continuous Integration to the Max
Continuous integration has been at the heart of Attunity’s technology for the last decade, said Potter. Every time a transaction is updated in a system — ERP, for instance — the change can be moved into a data warehouse or lake. In March, the company announced Attunity Compose for Snowflake, which the company says automates the data warehouse lifecycle, combining real-time data integration with data warehouse automation.
Metadata has a critical role in all this, Potter said. Companies need to profile what the source is, and if the metadata has changed to accommodate things like a table structure change and redirect the source system metadata to the created target systems.
And in recent years, Attunity has added a focus of applying automation on top of that: “We can land data into data lakes in real time, but now we can automatically structure it to be used for data analysis, so you don’t have to have ETL scripts to build in.”
Generally, any DataOps use case demands automation from end to end.
“Organizations want to have the benefit of technology like Azure by automating what in the past has been a manual and script-intensive process of moving, merging, structuring, and provisioning data,” Potter said.
In the past, this has not been fast enough — or accurate enough — to keep up with business demands. “Users might have no confidence because they may be missing some data,” he said. “They aren’t realizing the benefits and promises of cloud data lakes and data warehouses.”
Gartner has noted, Potter said, that 90 percent of those who invested in first generation data lakes failed to realize expected benefits. “If data is not ready for analytics, the opportunity cost is huge.”
Potter cited a customer use case — that of an Attunity Fortune 500 insurance provider that saw big benefits when it moved from hand-coding its traditional ETL scripts to an automated approach. The customer experienced an 80 percent reduction in implementation costs, realized a 95 percent faster time to market, and a quantified 500 percent increase in agility, Potter said.
Making changes quickly is important because change incurs inevitably, Potter noted. Data must be distributed to business users, augmented, and then delivered again. An automated pipeline is the route that gets it done.
Image used under license from Shutterstock.com