Containers are thriving in the IT community thanks to their value at running entire runtime environments and for supporting a microservices approach to building applications. In a survey conducted last year by 451 Research, 71 percent of 200 large enterprise IT leaders said they are using Kubernetes to manage their container infrastructure. The usage is primarily driven by Hybrid Cloud/Cross-Cloud integration and efficiency.
But there are opportunities to improve the de facto container orchestration system. BlueData saw that there were some gaps to be filled in deploying and managing complex distributed stateful applications and for optimizing Kubernetes for Big Data Analytics, Data Science, AI, Machine Learning, and Deep Learning. Most applications in these areas are typically stateful, consist of a multitude of co-operating services, and generally are not implemented in a Cloud-native architecture like Kubernetes.
BlueData – which offers a purpose-built software platform based on Docker Containers to manage the lifecycle of Big Data, AI, Machine Learning, and Deep Learning apps – came up with a way to bring such enterprise-level capabilities and support for distributed stateful applications to the Kubernetes open source community. It recently introduced BlueK8s.
“We now see that Kubernetes seems to be mature, and we see customers starting to deploy and use Kubernetes clusters in their data centers,” says Tom Phelan, co-founder and Chief Architect at BlueData. The company has rarely seen this kind of sea change in the industry; it may be comparable to the early 2000s with vmWare and the emergence of virtualization, he says. Now, in 2018, Kubernetes essentially creates a thinner, lower-cost and higher performance solution than VMs, and clusters can indeed be a replacement for them.
With companies such as Red Hat and Pivotal throwing their weight behind Kubernetes deployments by offering commercial support for them in the enterprise, the risk for IT decision makers was reduced. It became clear that Kubernetes wasn’t just a fun technology to play with anymore. “These giants have a serious push behind this, so we thought, what can we do to let complex stateful applications for Hadoop, Spark, Cassandra, and TensorFlow run effectively on a container orchestrator solution that historically has targeted stateless microservices apps?” Phelan says.
The BlueK8s initiative is designed to help customers invested in Kubernetes deployments who want to run Big Data and other stateful apps on the same pod infrastructure they have been using for stateless apps. But that means running an app of a different architecture, not being cloud native or microservices-based, and being full state. The Kubernetes community had introduced concepts like StatefulSets for creating stateful apps. “But they weren’t sufficient for running something like Hadoop or TensorFlow,” Phelan says.
BlueData joined the Cloud Native Computing Foundation (CNCF) that manages the Kubernetes open source project with the intention of “leading the charge to get Big Data apps working well on Kubernetes,” he says.
BlueK8s will consist of a series of open source projects to be launched over the next year that will make it easier to reliably run and secure distributed stateful workloads for Big Data, Machine Learning, and more on Kubernetes. The first offspring of the project is Kubernetes Director (KubeDirector for short), a custom controller which simplifies and streamlines the packaging, deployment, and management of complex distributed stateful applications for Big Data and AI use cases.
Simplicity at the Center
Today, to build and manage a cluster for a stateful operation, “I have to write application-specific Kubernetes Operator code,” Phelan says. That’s a fairly complex task requiring experts familiar with the internals of Hadoop or Spark as well as Kubernetes.
BlueData wanted to share with the Kubernetes community the fruits of its own experience running data applications in containers for many years, drawing on the expertise and intellectual property it has built. KubeDirector is built on the Kubernetes custom resource definition (KCD) framework and it will enable data scientists to run Big Data apps on Kubernetes without having to write Kubernetes Operator code.
“It’s just specifying a simple YAML markup file with some basic information about an application like Spark,” Phelan says. “Kubernetes Director takes the YAML file, applies magic, and then deploys on Kubernetes in distributed environments.”
The company specifies the capabilities of KubeDirector for supporting the management of distributed data pipelines consisting of multiple applications such as Spark, Kafka, Hadoop, Cassandra, and TensorFlow. Those capabilities include the following:
- It leverages the native Kubernetes API extensions, design philosophy, and authentication.
- There is a minimal learning curve for those developers familiar with Kubernetes.
- It is not necessary to decompose an existing application to fit microservices patterns.
- It provides native support for preserving application configuration and state.
- It utilizes an application-agnostic deployment pattern, minimizing the time to onboard stateful applications to Kubernetes.
- It is application-neutral, supporting many applications simultaneously via application-specific instructions specified in YAML format configuration files.
BlueData’s expectations are that developers in the Kubernetes community will provide additional YAML files to run more types of stateful applications.
For instance, “We hope that some business intelligence and business analytics vendors might write YAML files so that customers can deploy those kinds of proprietary BI and Machine Learning tools in open source methodologies,” Phelan says. “Their open source customers are asking them how they will run on Kubernetes in the data center. We hope they will say that they have a fast on-ramp with Kubernetes Director to get on the customers’ sites and deploy Kubernetes clusters tomorrow – not next year.”
Image used under license from Shutterstock.com