Click to learn more about author Oksana Sokolovsky and Rohit Mahajan.
With the Big Data market and its seminal technology, Hadoop, each about a decade old, enterprise customers need answers to important questions about the Big Data ecosystem health and state of adoption. The ecosystem is maturing. What does this mean for Hadoop, Big Data and Open Source Analytics in general? How does it impact enterprise companies, their data platform strategies and their resulting Data Governance requirements?
Major Progress, and a
Few Steps Back
Adoption of Hadoop as a platform has solidified. Hadoop’s storage system (HDFS)
as well as Hadoop-based open source projects like Hive, HBase and Atlas have
found their way into the enterprise computing mainstream. There’s a growing appreciation
in the enterprise for data lake technology.
Enterprise customers now acknowledge the value in maintaining data in
raw form, on low-cost storage media or services, and querying it on an
exploratory basis.
There have been setbacks for the Big Data world that can’t be ignored, though. To start with, there’s been fragmentation in the Hadoop ecosystem that has created confusion, and adoption risk, for customers. Another factor has been the growth in popularity of Apache Spark, alongside Hadoop.
Going forward, customers will have their data in a variety of repositories: Hadoop, Spark, data warehouse platforms, and various operational databases. Physically, the data will be stored in a combination of HDFS, cloud object storage and database-specific storage layers.
The Governance Challenge
As a result of this flux, customers’ data will be itinerant and dispersed.
Even as the Big Data ecosystem has gained critical mass, complexity remains,
and there’s more to come.
New platforms will create new homes for data, and this creates the danger of new data silos, something enterprise customers can’t afford. Customers need to manage their data proactively; new regulations preclude a lackadaisical approach to Data Management.
Enterprise customers will need to look for Data Governance or data catalog platforms that can bridge across the data landscape, including classic platforms and leading-edge technologies. Your governance platform vendor will also need to provide support for new data technologies which haven’t yet emerged, as long as they attract a critical mass of support.
Innovation in tech is exciting, but it brings complexity along for the ride. In an era of significant data protection regulation, this creates serious risk that customers must manage. A savvy choice of the data catalog/governance platform is essential to addressing that risk. It’s the only way to fully leverage data assets for digital transformation – safely, ethically and profitably.
Good Things to Come
Recently, Cloudera and Hortonworks, the two Big Data companies that can
each trace their lineage to the original Hadoop team, finalized their merger. This
is good news on the fragmentation front: the two companies will go through the
process of merging (or purging) their respective ecosystem projects and
technologies as well. This will create some much-needed market clarity and
should boost customer confidence. It’s a pro-growth move.
The open source nature of Hadoop and other components in the ecosystem has likely posed challenges to both companies. While there’s nothing slouchy about open source software, the “commodity” perception it creates for itself can make it hard for companies to build a business model around it. But the new combined Cloudera-Hortonworks entity will be the only major vendor supporting a more standard distribution, and that could remove the commodity stigma.
Caution is Still Prudent
Even as the Cloudera-Hortonworks merger brings about simplification, new complexity
will emerge. While the Big Data
ecosystem has established itself as mainstream, there’s still a lot of churn
left in the platform, since new engines and repositories keep presenting
themselves. Conversely, companies will merge, get acquired or simply exit the
market, causing certain technologies to become deprecated and de-supported, yet
persist as data repositories for their erstwhile customers.
Data Governance platforms will continue to be critical to mitigating the risk brought about by this complexity. A good Data Governance implementation anticipates complexity in the data landscape and offers an abstraction layer over it, for data engineers, data scientists and especially for analysts and business users. Solid Data Governance lets you navigate the data complexity storm and, where data is concerned, it’s essential to success, regulatory compliance and innovation.