Learn more about Thomas Hazel.
This column will not be the proverbial “Pros and Cons” article, weighing the good with the bad. One can find such content habitually year after year and month after month, all of which will outline the obvious advantages and disadvantage between any two things. This is particularly prevalent in the world of technology. Probably a result of the ability to compare and contrast features and functions so easily.
Such comparisons are typically used to determine which path to choose and are often derived from others legwork such as a Pilot or Proof-of-Concept (PoC). Traditionally each Pro and Con is distilled into one or two word sentences, based on surface level analysis. In the case of complex systems, such as big data management, this could lead to limited or inadequate descriptions (Big data is literally and figuratively complicated). On the surface, using basic analysis is a good starting point when it comes to initial research and thus a Pro and Con review would seem to apply well. But does it apply well enough for an Information Technology (IT) professional to bet the future “yet to come?”
This column will provide warehousing versus data lakes distinctions, but more importantly it will outline their intrinsic drivers and hopefully begin to help exercise the “ghost of data yet to come.” But first, it should be noted that the going perception is that warehousing is antiquated, unable to handle today’s and tomorrow’s big data demands, where the prevailing winds all lead towards data lakes as the inevitable choice. Who will be the “winner” to say the least is complicated, as mentioned in my previous column Big Data Doctrine – Warehousing vs. Data Lakes.
The Data Warehousing establishment will say data lakes are just a fad, a fling you had in high school that will never amount to anything. On the other side are the Data Lake proletariats, who see themselves as “information liberators”, freeing your data from today’s oppressive practices. Neither viewpoint is accurate, but some parallels can be drawn.
Data warehousing has been around for thirty years providing business intelligence for structured operational data. The source of this data, for all intents and purposes, comes from Relational Database Management Systems (RDBMS). Warehousing is generally costly and proprietary where many of the RDBMS providers also offer such solutions. Data Lakers on the contrary are relatively new (around five years) and are routinely based on commodity hardware and open-source projects (e.g. Hadoop). Typically, the source of this data is, well… anything and everything, but primarily semi-structured than unstructured respectively. So without further ado, the differences between “Warehouse vs. Data Lakes”:
The above table may seem small and without a doubt there are more exhaustive lists, but these differences are the major distinctions one should takeaway. Though it should be noted that any specific category is not exclusive to any particular choice; there are always outliers, but for the sake of our “this versus that” argument, it will most certainly suffice.
Data Type
The first distinction is the type of data each solution manages and is generally the key catalyst between choosing one versus the other. Consequently, in its simplest form, one would choose warehousing if all source data is structured and data lakes if the source is anything but. Yet, like in many IT departments, a primary mission is to not increase information silos and complexity but to decrease them. And when it comes down to utilizing big data (a.k.a. the big promise), information unification is the final objective.
Data Model
Another common perception is that warehouses are well engineered, where data lakes are accidental and at best incidental. This viewpoint is also not accurate, though does have some parallels in reality. By definition, the warehouse process model is schema on write, which means source data needs to fit into a predefined structure. In other words, planning is a major part in executing towards a warehousing solution like that of other structured data sources such as SQL (relational) databases.
But like relational databases, warehousing solutions take time to design as well as evolve over time. As a result, alternative data sources such as NoSQL (non-relational) databases, paired with data lakes, have been created to overcome such issues. This information alliance encourages a schema on read process model. This model allows for real-time storing of data in its native format, where data retrieval is dynamically structured and ultimately meeting today’s on-demand (i.e. new economy) requirements.
Traditionalists might not want to hear this, but well planned (i.e. waterfall) thinking is being pushed aside. Pushed aside because new economy companies are now generating more diverse data than ever before; and in this data, is the actual business value: intelligence and insights. For instance, ninety percentage of the world’s data has been created over the last two years alone and the majority is not classic operational[1].
Data Discipline
Warehousing is normally geared towards Data Analytics seeking predictive intelligence for optimizing business operations. In other words, people looking for information around “known unknowns.” On the other side, data lake projects are tailored towards Data Science seeking predictive insights for framing business strategies. Expressly, people looking for information around the “unknown unknowns.”
This segregation of intelligence for warehousing and insights for data lakes might be a bit too restrictive, each solution most certainly has some aspects of the other. However, this force separation is really to outline the actual distinctions, their origins and intrinsic drivers.
Intrinsic Drivers
The shift from manufacturing to a service economy to the new economy has not only changed what we do, but how we do it. Through each economic phase, technology has become ever more critical to business success. Technology normally evolves to meet the demands of business (evolutionary), but sometimes businesses evolve based on technology innovation (revolutionary).
In the case of our “Warehouse vs. Data Lakes” argument, I think initially the prior was true; data lakes evolved to meet the demands of business. Data lakes (in partnership with other new technologies) were a byproduct of the on-demand economy. However, these technologies have done something not many have done before: foster a new industry. An industry founded on a new currency: real-time information based on predictive intelligence and insight. This technology yin and yang is self-feeding. In other words, one begets another which begets another; exponentially growing and morphing.
Where this “Big Bang of Information” explosion will take us is hard to foresee? However, in my next column (Data Lakes: Evolutionary or Revolutionary), I’ll explore this new universe and hopefully provide predictions into the future of big data management.
[1] https://en.wikipedia.org/wiki/International_Data_Corporation