You’re going to hear a lot more about DataOps in the coming months and the next couple of years. That’s the word from DataKitchen co-founder Eric Estabrooks.
You know something is gaining market traction when Gartner includes it in a hype cycle report. The research firm did just that for DataOps in its Hype Cycle for Data Management. As much as anything, Gartner taking notice is influencing executives to start asking if their business is doing DataOps, or why it’s not. “You don’t want to be caught flat-footed by that,” Estabrooks said during his presentation titled DataOps and Modern Data Architecture at the DATAVERSITY® Enterprise Data World Conference.
So, what should you know about DataOps and why is it essential? What are its seven principles? How can you start implementing this into the business?
Estabrooks explained that the process-oriented methodology reduces cycle times for getting data with high quality into the hands of users for analytics. “It’s really just about getting things out faster to your customers because at the end of the day, data is not useful until someone’s drawing some value from it,” he said.
DataOps meshes ideas from Agile development, DevOps, and lean manufacturing and pulls them together to push data down the pipeline. Like DevOps transformed software development in recent years, and “lean” has been transforming manufacturing for decades, DataOps is going to transform how data analytics works. People are going to start expecting insights faster, he said, and their expectations will be met:
“This is where the business asks for something — you want to be able to tell them, ‘Yeah, I’ll get that to you later today or we’ll put it in this week’s sprint.’ You can deliver value almost immediately.”
The Principles Behind the Process
There aren’t so much steps to take to implement DataOps as there are principles to follow. “If you do everything, everything’s just better,” he said. The DataOps Manifesto put together by DataKitchen actually includes 17 principles, but the seven he explained during the presentation are a good starting point.
It starts with orchestrating two different journeys, making them happen simultaneously and seamlessly, to enable a very quick turnaround of providing high-quality data that can be delivered to analytics. “That lets you start to keep up with the business,” according to Estabrooks.
The journeys are to get from data to value, getting it ready for analytics in a production-line fashion, and building on that to get a new idea up into production quickly.
“If you’re not thinking about automation, reproducibility, central control of things, you lose the ability to get things from raw data into someone’s hands where they’re using it as quickly as possible.”
Estabrooks remarked that it’s important to think of the principles of software development to achieve this. Is it modular and testable? Is the code base shared? Is it well-versioned controlled?
“New data sets come in, hit the production line, and value comes out,” he said, “but then you set up a system where your innovation, your ideas come; you can experiment and iterate very quickly; and then when that’s fully baked, you can push that up into production as well.”
And you can do it without compromising Data Quality or breaking anything.
If you can orchestrate these two journeys, you start to enable a very quick turnaround of high-quality data that you can deliver to analytics, he noted.
Now, the innovation pipeline takes up where the value pipeline — which is all about Data Quality (what it looks like, if it’s doing what it’s supposed to be doing) — leaves off. With the innovation pipeline, you’re starting to think about things more akin to regression testing. “I implement some new code that’s doing a transformation or maybe a calculated measure or a fact table. Do I have good enough tests where I know people working around me on different parts of the code can get through — that I can make changes without breaking anything else anyone has done,” Estabrook further explained.
No manual testing of data processes need apply. Failure is too likely an option when people are involved, as they might make mistakes like not following a checklist.
“If the automated process tells you that everything’s correct and the best information you have about what makes that data good is built into your process, again, you can move fast without worrying about breaking things because your tests and your process are going to tell you that things are in good shape,” he said.
Continuing with the Principles
The third principle is version control, which prevents team members from overwriting each other’s changes, retains a file version history, identifies the set of files that comprise each build version, track files associated with different stages in a delivery pipeline, and lets individuals share code in a central repository for collaboration. “If you’re not using version control, start using it right away,” he said.
Branching and merging come next — if you’re writing your code in a way that supports modularity, it helps you test it better. When working on a feature, you do your code in your own branch, and that work is promoted to production when automated tests are passed. “If there’s an automated process on that master branch, it just gets picked up at the next build,” he said. “You can do it a couple of times a day. Again, with data, depending on what you’re doing, you’re not switching it out. You might have daily, hourly, monthly builds, whatever the frequency of the incoming data is, but knowing that you’re pushing stuff up and it’s just going to work lets you sleep at night.”
This all plays into another principle that’s key to DataOps: the creation of ephemeral environments. Everybody should have their own environment. If a new requirement is necessary for developers who are working on a new feature and have their own code base, that should be able to get spun up — either as a full snapshot of the data or a subset to help speed development. In any case, this should be something a developer can do at the drop of a hat.
“Depending on your data sets, and if you’re using a cloud or hybrid architecture, spinning up a new cluster that’s got 50 gig of data is trivial; it takes like five minutes or less than that,” he said. “When you start to get into multiple terabytes and larger, then you’ve got to start to think a little bit more, be a little more thoughtful about what your test data sets look like.”
Stay away from monolith code creation, he added. Instead, you want the ability to reuse and containerize all the code that comprises the data analytics pipeline. Code reuse can significantly boost coding velocity and containerizing it makes code reuse much simpler to do. Parameterizing processing supports the data analytics pipeline being flexible enough to incorporate different run-time conditions as necessary. “You give yourself permission to refactor,” he said.
Another thing, as the DataOps Cookbook explains, is to work without fear or heroism. With DataOps and the optimization of workflows, quality is assured and that means engineers, scientists, and analysts all can relax.
“You need to think about all those principles and if you like them and you believe in what they can do, make sure that you pick tools that will help you do that, whether it’s one or multiple,” he advised. “Just think of the kind of ecosystem you’re building together.”
Want to learn more about DATAVERSITY’s upcoming events? Check out our current lineup of online and face-to-face conferences here.
Watch the Enterprise Data World Presentation here:
Image used under license from Shutterstock.com