Click to learn more about author Andrew Brust.
What’s going on in the data world right now, and how will it impact the market in 2018? There are the obvious, banner headlines, of course: AI is everywhere and will change everything; Enterprises continue to move their infrastructure – and data – to the Cloud; GDPR will make data protection every company’s priority. But you knew all that. And maybe you were a bit skeptical of the grandiose claims anyway.
What substantive changes are really taking place? What do you need to be aware of as you set your architectural and procurement strategy and make decisions in those areas? We set out to identify seven impactful changes taking place in the analytics arena, right now, and we present them to you now.
- Hadoop is Fundamental
Yes, those Big Data project failure rates have been high. Yes, Spark has in some ways displaced Hadoop and increasing numbers of customers are running the former independently of that latter. So the industry blames Hadoop…and stops uttering its name. Hadoop must be dead, right?
Wrong! Everyone’s talking about Data Lakes now and, much of the time, that’s just code for Hadoop. And while, yes, many organizations are implementing their data lakes in cloud storage layers, they’re often using Hadoop ecosystem technologies to analyze that data. Beyond that, consider that cloud storage layers can be made to emulate HDFS, Hadoop’s file system, and you start to realize that when you ponder cloud Data Lakes and Hadoop Data Lakes, there’s a distinction without much difference.
The good news is that this year, Hadoop’s going to do what it always should have: see adoption by the Enterprise, without great fanfare. Hadoop will become one data tool among many, and will be used when it makes tactical sense. It’s the combination of data technologies, including Hadoop, Spark, Business Intelligence (BI) and Data Warehouses that make the current analytics market so exciting.
- Bye-Bye, Enterprise Stack BI
Earlier this year, MicroStrategy, the Enterprise BI pure play, announced its concession to the companies that compete with it on the front-end, by introducing connectors to their products. MicroStrategy is doubling down on its belief that their back-end OLAP platform, and associated data governance capabilities, are where it can best monetize. The company also seems to have decided that competing on the visualization and dashboard side is difficult and, even to the extent that it can be successful, provides diminishing returns.
Will the back-end be enough to sustain Enterprise revenue and supported growth? We’ll have to see. But one thing’s for sure: The monolithic Enterprise BI stack has become disaggregated and old dogs will need to learn new tricks.
- Data Hierarchies
Maybe you’re familiar with the concept of data hierarchy, in terms of data storage and its correlation with frequency of access. “Hot” data – that which is used most often – is sometimes routed to very fast storage like solid state drives, or even CPU memory cache. Colder data is often routed to older – but cheaper – spinning hard disk drives.
With the storage hierarchy well-established, we’ll start to see recognition for other hierarchies this year. For example, analytics involves work with everything from experimental data sets that may be relevant to particular teams or business units, to highly structured, vetted and consensus-driven data that is useful to the entire Enterprise. In the middle are structured data sets that – possibly due to size, or level of cleanliness – are seen as somewhat less than production-level.
Experimental data sets sit best in a Data Lake. Highly vetted data sets are most logically kept in a data warehouse. And the mid-level data sets will likely live in Hadoop or Cloud storage, but will often be queried from relational databases, using SQL-on-Hadoop bridges like IBM Big SQL, Microsoft PolyBase, and Oracle Big Data SQL.
Another hierarchy might stratify data according to whether it will be used in the design of machine learning models or just for straight analysis. And another might be defined by the trustworthiness of the data source.
The reason hierarchies will be important is because there’s also a hierarchy of tools and technologies, including BI and Big Data Analytics tools on query side, and transactional databases, NoSQL databases, Data Warehouses and Data Lakes on the repository side. Eventually, the hierarchies might simplify and the technologies might consolidate. But with so many technology choices right now, we’ll need hierarchies in the data to dictate our best practices in toolchain deployment.
- Visualization Commoditization
MicroStrategy’s announcement of connectors to Tableau, Qlik and Power BI is more than a concession to competitors. It’s a de facto acceptance that those three self-service BI tools are now, essentially, the standard! Those companies have erected their own barrier to entry for others to do well in the visualization space.
They have also commoditized the whole area. Between Tableau Public, Qlik Sense Cloud Basic and Power BI Desktop (and the free tier of the Power BI Cloud service), there’s a long tail of entry-level analytics that can be done for free. Add in tools like plot.ly, the D3 ecosystem and open source geospatial/mapping platforms and you’ll find your analytics capabilities are more limited by available time then they necessarily are by money.
Users are taking good data viz capabilities for granted now. They are still impressed by them, but not bowled over. Good viz isn’t so much a competitive edge anymore. Rather, bad viz is a competitive liability.
- We’re from the Governance and We’re Here to Help
Data Governance technologies, which once inspired little more than emphatic yawns, are finally starting to get some respect. Data regulations, the European Union’s General Data Protection Regulation (GDPR) being a case in point – have finally made lack of governance enough of a pain point that demand for effective governance tools will be significant this year.
Even if regulatory compliance is the catalyst, though, there will be other drivers behind the governance. One of the biggest is around data catalogs, and their ability to make data sets in the Data Lake more organized, and more discoverable. Data discovery tools, which look through your databases and Data Lakes and report on relationships and data flows within and between them, help in this regard, too. These tools, in turn, make the Data Lake itself more usable and give the Data Lake investment more efficacy. As companies seek better return this year on those investments made in prior years, data catalogs and discovery tools will increase in popularity, further driving the governance wave.
- Cloud Data Lakes = Cloud Data Lock-In
We’ve already talked about the trend of Cloud storage-based Data Lakes. But the reality is this isn’t just an interesting Cloud use case that emerged organically. In fact, it’s a central selling point and selling strategy for the major Cloud vendors.
The more of your data that you land and maintain in a given Cloud platform’s storage layer, the more work you will do on that data on that same cloud platform. Data preparation. Analytics. Predictive modeling, and model training (on high-end GPU-accelerated virtual machines). The cloud battle is the data storage battle, and the winner may have you quite locked in.
- Contain(er)ed revolution
Everyone knows that Docker-based container technology, in the data center and software development worlds, is changing everything. The disruption is loud and proud.
But did you know similar things are happening in the data and analytics world as well? It’s harder to tell, because the shift is much less conspicuous. But it’s real, nonetheless.
- MapR has already re-oriented its Converged Data Platform around containers with its PACC (Persistent Application Client Container)
- The Cloud providers are also taking advantage of container technology to make the provisioning of nodes faster and to facilitate greater resource sharing – allowing ephemeral clusters to seem persistent.
- Hadoop itself, which recently hit its 3.0 release, will soon support the ability for code deployed to YARN to run in the context a Docker container, thus allowing Hadoop job code dependencies to differ from what may be installed on each node in the cluster.
- What’s becoming clear is that each software vendor, whose products have dependencies on underlying versions of other software, is recognizing that containers eliminate version conflict problems – for them and for their customers
Go Forth, and Make Decisions
Big trends are fun to identify and predict. But it’s the specific, focused changes taking place in the industry, and the strategies being pursued by vendors and customers, that can inform your own plans. They provide fodder for your decisions: what you’ll do this year, what you won’t, and what outcomes you can reasonably expect. In an innovative hot spot like analytics, you constantly need to plan your initiatives and place your big bets, but you need to hedge them too. We hope and believe this list of seven trends can help you do both.