The tail end of 2017 saw the official delivery of Hadoop 3.0 from the Apache Software Foundation. With it came HDFS erasure coding, which enables a significant reduction in storage overhead and its costs. Opting to use erasure coding instead of three-way replication – in which three copies of each block of data must be spread out across different physical devices to avoid a single point of failure, consuming 200 percent overhead in storage space – gains the same degree of fault tolerance with much less storage space.
That’s important now that the increasing size of data sets for Big Data Analytics makes it more and more prohibitive to pay the disk penalty. “Most data sets now are well over 10 petabytes,” says Tom Phelan, Co-founder and Chief Architect at Big Data solutions vendor BlueData Software. “Paying a two-times cost just to prevent data loss is too much overhead.” With erasure coding, the storage space overhead is closer to just 30 percent, he says.
But this enormous advantage doesn’t stand alone.
“For 15 years or so Hadoop deployments have been built around the assumption that to get to high performance in Big Data analysis the data needs to reside on the same physical host as the CPUs doing the processing,” Phelan explains.
Co-location of compute and storage was necessary to minimize network bandwidth use and read latency. A side effect of this is that compute and storage resources could not be scaled independently. Tightly coupled compute and storage made scaling painfully pricey, not to mention hard to manage and brittle.
Erasure coding changes that dynamic, with each HDFS read and write requiring network usage. “That’s a huge paradigm shift for Big Data Analytics,” he says. “It has broken the concept of compute and storage co-location by its very nature.” And scaling compute and storage independently can lead to higher utilization and cost efficiency for Big Data workloads, as well as eliminate the cost and risks of data duplication when analytical processing is separated from data storage.
Erasure coding, in fact, became practical because of the evolution of network technology, as network and CPU performance both increased – and even more quickly than storage controller performance. “We’re not limited by one-gigabit networks now,” he says. “Every data center has 20, 30, 40, even several hundred gigabits per second hardware.”
Opening Whole New Worlds
Erasure coding is built on the assumption that the network is no longer the bottleneck, he says – a huge change that opens many possibilities. While people have taken Hadoop to the public Cloud for years, for example, it’s tended to be for special case scenarios – jobs that were considered non-time critical might be sent to run on Amazon services, for instance, in the hope that its internal network was operating fast enough that they’d complete in some sort of reasonable amount of time. “But always it was not really a best practice to the hard-core Hadoop guys,” Phelan says.
It’s true that when using erasure coding there is a CPU penalty, with some of the cycles in the server directed to computing the parity blocks added to the data so that failed blocks can be recovered. So, with Hadoop 3.0, erasure coding storage may still be employed over three-way replication for operations involving less-used data tiers. “Hot data that you are processing would still be three-way replicating,” he says. But that issue already is being addressed by vendors, such as Intel with ISA-L, that are implementing acceleration libraries that more or less eliminate that CPU overhead, Phelan says.
“It’s fairly certain that as Hadoop 3.1 and 3.2 roll out, and as these acceleration libraries become commonplace, that the best practice will be that erasure coding storage is recommended not only for cold storage but for warm and hot data, too,” he says.
Communities at this point would likely want to move their data from HDFS with three-way replication into HDFS with erasure coding storage to get all its benefits of performance and flexibility.
Among new Big Data processing scenarios, for example, say that a user with a hundred physical-host cluster finds out it needs some more compute resources to run the job faster. It’s merely a matter of adding ten or so more physical boxes that don’t even need to have local storage.
“They can just be rack-mounted servers to dramatically increase performance,” he says. “They don’t have to copy the data or portions of the data to each of those new hosts. So, it’s a more flexible, agile, and cheaper way of using the Hadoop clusters. There are real world positive benefits to this.”
Taking the Blue Data Road
BlueData, he notes, saw this coming some years ago, with its Big Data-as-a-Service platform (for on-premise, Cloud or hybrid deployments) built on the idea that you don’t need to co-locate compute and storage resources. “We are just happy to see the Hadoop community embracing the statement we have been making,” he says. BlueData’s own code handles I/O acceleration across the network so that users can run time-critical workloads, taking advantage of compute and storage separation while achieving the same or even better performance than they’d get in a bare metal server environment.
Its purpose-built software platform that enables compute and storage separation for Big Data environments allows users to:
- Spin up instant multi-node clusters for any Big Data or Data Science tools, using Docker containers
- Enable compute and storage separation for Big Data environments
- Deliver true multi-tenancy across shared infrastructure, either on-premises or Cloud
- Provide secure access to various datasets on-demand without data duplication
- Deploy unmodified and highly available Big Data clusters, then scale as needed
- Ensure performance and security comparable to co-located bare metal clusters
And:
“The introduction of erasure coding into HDFS means we get more performance now as we don’t worry about three-way replication and hotspots in data, so we can let our customers utilize lower cost of hardware and still give them all the benefits of accelerated delivery across the network,” he says.
The company will be coming out with a new version of its services to conform to the API protocol that relates to the new way of storing data on the HDFS client side.
“This is a step in the evolution of Hadoop and we are looking forward to more enhancements in the performance of this distributed file system,” Phelan says.
Photo Credit:Titima Ongkantong/Shutterstock.com