Click to learn more about author Karthik Ramasamy.
It sounds appealing – easily store all of your data in a single location, where all of your users and applications can access it and put it to use. It’s no wonder that interest in data lakes rose rapidly at the same time hype around “Big Data” was exploding: a data lake seemed like the natural place to store big data, especially when you don’t know exactly what you will be doing with it. But this approach has fallen drastically short of expectations, and I predict that 2019 will mark the start of a new trend: draining the data lake.
In many instances, data lakes have become derided as “data swamps,” murky pools of unknown content in which it is difficult or impossible to really see and understand what’s inside. Late last year even Gartner felt compelled to publish guidance to businesses trying to salvage failing data lake projects.
So, why are data lakes failing? There are a number of reasons:
Too Complex
Data lake projects turned out to be far too complex for most mere mortals to take on. They often were built around Hadoop, a complex system in its own right. Add in the complexities and additional technologies to handle data governance, security, data transformation and more, and it all becomes too much to deal with.
Yet replacing Hadoop with something simpler and more elastic for storing data, like cloud storage, also failed to provide the answer, as teams have the added challenge of building or integrating the tools and services required to make that useful. In short, while the concept of a data lake is simple, the actual execution is anything but.
Resource Drain
The time and effort put into maintaining data lake platforms – and trying to make the data in the lake useful – ended up sucking up resources from other important projects. Accumulating large quantities of data in a complex system (e.g. Hadoop) meant that not only did you become even more dependent on finding skilled data engineers who could extract something useful from that data, you also needed skilled people to operate and maintain those systems. With the demand for these specially trained workers on the rise and exceeding the supply of them, finding and retaining good talent is already hard, and only going to become more difficult.
Data Privacy and Compliance Risk
The parade of stories of data misuse or compromise is never-ending. Add to that the critical need to meet data regulations such as GDPR and California’s new data privacy law and some are questioning the decision to build a data lake in the first place. After all, the initial idea behind a data lake, “we’ll keep as much data as possible for as long as possible in case we might ever need it,” is increasingly a major risk and even liability for organizations as they increasingly face regulatory requirements to reduce retention of personal data as well as reputation and liability risks from data breaches and data misuse.
Sluggish Speed
Data lakes were designed for a batch-oriented world, where it’s okay if insight takes time, not today’s world of constantly arriving data that needs immediate analysis and reaction. The data lake itself made sense when technology limitations meant that data had to be stored and processed in batches, a scenario that is increasingly less relevant as advances in fast-data technology have allowed companies to process and act on data as it arrives.
Thus, in a very real sense, the data lake has actually become a giant drag on the ability to act on and react to data quickly. The effort required to get all of your data routed into a data lake, only to need to then start looking at and transforming that data before you can get value out of it, adds multiple layers and numerous delays into the process.
Wrong Fit
Because data lakes are best suited for scenarios where you don’t know what you want to do with your data (i.e., you’d store it until you did), they have not worked optimally in cases where you have a clear idea of what you want. This has required teams to create a separate pipeline to bypass the data lake. This either defeats – or at the very least complicates – the whole premise of the data lake as a centralized repository of all data.
Draining the Data Lake Can Help
For these reasons, enterprises have begun questioning the centrality of the data lake for their data architectures. Some have already made a decision and are starting to drain the lake of data sources that are not well suited to it–in other words, data sources with a clearly understood use-case and/or where there are benefits to acting on that data quickly.
That’s not to say that data lakes will no longer have purpose, just that their usage will be more selective rather than all-encompassing. For instance, data lakes will likely continue to be used as sandboxes for data science–places where data is kept to allow data scientists to run experiments and explore data. However, data will also no longer indiscriminately flow into the data lake for indefinite retention, rather the data lake will primarily be used for temporary storage of data from new sources for which use cases and applications are not yet known (hence the exploring and experimenting). Once use cases for that data are identified, data pipelines to support them will be created that are outside the data lake for simpler, more efficient data movement. An added bonus: this approach reduces the difficulty of compliance as well as the risks of having a huge “pot of data gold” for hackers to target.
Draining the data lake also places a greater emphasis and focus on the data pipelines, where all of the current action takes place, as opposed to the historic storage of a data lake. Emerging approaches to data distribution and data pipelines will become central to how this data is collected and delivered to users and applications for deployed use cases. Data-driven technologies such as messaging and stream processing (especially services based on them) will be at the core of this approach, as they allow event-driven data pipelines that can collect, process, and deliver data as soon as it arrives, in the form needed by each application.
Just as manufacturing was transformed by just-in-time supply chains that eliminated the expense and complexity of warehousing large inventories of parts and raw materials, data-driven pipelines and designs are going to transform how data is connected from sources to users and applications, forcing the data lake to slowly shrivel.