Big Data Ecosystem Updates: Machine Learning, Deep Learning, and the Edge

One of the recent stories within the Big Data ecosystem is that Cisco is joining the AI Hardware frame with a new deep learning server powered by eight GPUs. Cisco is promising support within its AI push for Kubeflow, “which is an open source tool that makes TensorFlow compatible with the Kubernetes container orchestration engine,” said James Kobielus, the Lead Analyst at Wikibon, in a recent DATAVERSITY® interview.

TensorFlow acts an open source software library used for numerical computation. It uses a flexible architecture designed for easy deployment across a diverse array of platforms (GPUs, TPUs, CPUs), and a range of devices (desktop computers, clusters of servers, various mobile and edge devices). TensorFlow was originally developed by the Google Brain Team (part of Google’s AI organization). It has a flexible numerical computation core, and provides excellent support for machine learning, as well as deep learning. They have developed a new deep learning server powered by eight GPUs.

James Kobielus believes containerization is initiating a new era in software. Containerization is remaking the landscape of nearly every IT software platform, and is impacting artificial intelligence (AI) and machine learning (ML). For example, Cisco recently announced it is working to improve the containerization of TensorFlow stacks. Kobielus said:

“When I talk about highly complex AI, I’m talking about something like TensorFlow. When you build a deep learning model in TensorFlow, let’s say the model is going to support an autonomous vehicle application, for example. There will be models, deep learning models inside the vehicle itself, of course, to be able process sensor data to do object recognition and so forth. There will be deep learning models running within area wide controllers that are controlling many vehicles, maybe for traffic congestion within a given zone.”

According to Kobielus, Apache Spark often runs in conjunction with Hadoop Distributed File System (HDFS) as the persistence layer, or storage layer. Spark is one of the premier machine learning development environments and uses an in-memory orientation. Increasingly, it is being used for real-time ETL, and data preparation for several hybrid deployments with TensorFlow, and is also being containerized.

Kubeflow

Software containers allow organizations to easily move workloads across different types of environments. Essentially, Kubeflow is a Kubernetes-based framework and tool, used to build and train Machine Learning models. These models may be containerized from the beginning. Some of the dominant themes in container research are Kubernetes orchestration, machine learning, and deep learning.

The containerization of the entire DevOps workflow for all application development is fast becoming the norm. This is especially true in the development of AI applications, said Kobielus. “Kubeflow enables DevOps to manage those applications within the containers-orchestrated environments from end to end.” Kubeflow is becoming a critical glue within the DevOps industry, including AI DevOps space, and supports the containerization of AI. Azure’s new machine learning program supports container-based model management and DevOps, as does Apache Spark.

Kubeflow makes “scaling” machine learning models, and then deploys them to production in as simple as a format as possible, he said. Because machine learning researchers use diverse tools, a primary goal is to tailor the stack to the user’s requirements and provide an easy-to-use machine learning stack in any location where Kubernetes is already running.

Machine Learning

Machine learning has evolved into a form of data analysis used for identifying patterns and predicting probabilities, and continues to exist as a subdivision of AI research. By providing data with “known” answers for models, a computer is able to train itself to predict future responses to unknown situations. Machine learning has had fair success in resolving specified tasks, and it is estimated AI and ML will be the lead catalysts driving cloud computing. To work effectively, ML needs to learn efficiently, and to integrate with cloud technologies, including containerization.

With this in mind, Google developed Kubeflow, a portable, composable, and scalable machine learning stack built atop Kubernetes. Kubeflow offers an open source platform for transporting ML models by attaching themselves to containers, and performing the computations alongside the data, rather than within a superimposed layer. Kubeflow helps resolve the basic problem of implementing ML stacks. The construction of production-grade machine learning solutions need a variety data types. On occasion, stacks have been built using different toolings, which makes the algorithms complicated and the results inconsistent.

Deep Learning at the Edge

Deep learning is a subdivision of machine learning that supports deep neural network computers “learning from experience,” and understanding the world using a hierarchical order. This hierarchy supports the computer’s use of complicated concepts by basing them on simpler ones. Real world organizations have combined machine learning and open source platform technologies in ways that the original developers of these separate, open source projects never anticipated. Kobielus stated:

“I think that deep learning and AI will be a powerful and absolutely essential piece to bring the Cloud Native Computing revolution all the way out to every device. We are seeing across-the-board in the mobile computing space, that intelligent devices, autonomous devices are where AI will live in everybody’s environment, on everybody’s machines.”

Such innovations are already happening with face recognition, speech recognition, and much more. But it needs to happen in a standardized way or be enabled through a standardized cloud to an edge deployment environment, that is containerized and uses Kubernetes. He continued:

“As a developer, I see that the key is the ability to containerize those models which are performing different tasks, and to enable those models to be hooked together in terms of an orchestration that enables them to play together as components, within a distributed application environment. Also, that enables those models to be monitored and managed on a real-time basis, often through a streaming back plane.”

Eclipse and the Cloud Native Computing Foundation (CNCF) have recently announced they are collaborating on building out the stack of containerized open source code, and the tools needed to deploy deep learning and machine learning containers into edge devices. The Eclipse Foundation provides a business-friendly environment for open source software, innovation, and collaboration.

Several months ago, the Eclipse Foundation initiated a project called Ditto, which was contributed to by Bosch. The project’s focus was on using digital twin technology to enable the development of AI, designed to run in a containerized fashion on edge devices.

Data Curation

Data curation is about managing and maintaining data and metadata assets. During the interview, Kobielus said:

“I like to use the word curation. The industry curates the stack by its several levels. The community curates by deciding what gets accepted as a project, what gets submitted to a work group to build it out, and then what gets ultimately risen up from sandbox, incubating to graduate in some this community curation. There’s vendor curation, meaning each of the vendors, cloud curation, and server curation.”

Kobielus sees this type of data curation as a necessary component of this new era. Some things will get accepted very broadly, or universally, and will take on a life of their own. Some things will fall by the wayside, like in the beginning with Hadoop, he said:

“I remember there were a few of the pieces of Hadoop like, for example, the Mahout Machine Learning library. I know that’s got some adoption out there, but it hasn’t achieved anywhere near the level of adoption the Spark library has.”

He doesn’t think data scientists, who are the core developers of AI, have fully caught on to the fact that they need to become more knowledgeable about containers and knowledgeable about Kubernetes, “because it’ll be in their tools and it’ll be the target environment for deployment of their models. So DATAVERSITY’s readers need to be aware of that.” Data scientists, AI developers, data architects, and everyone else in the industry need to understand how and why these new technologies are now core components in their data stacks he said in closing. Everyone involved needs to understand this or they will get left behind as this new age data moves forward.

Image used under license from Shutterstock.com

JOIN OUR LIVE ONLINE DATA AND AI ETHICS COURSE

Data Topics

Leave a Reply Cancel reply