Click here to learn more about Jason Nadeau.
Cloud data lakes are evolving rapidly. The innovative and deeply cost-effective data lake approaches to data storage, processing, and consumption that we have available today are something that not many could have envisioned just a few years ago. Enterprises now have the opportunity to leverage all sorts of different best-of-breed processing engines, tools, and technologies to take advantage of their data, explore it, and analyze it without having to move it into much more expensive data warehouses.
Data is a fundamental element of today’s business strategy. Whether it’s to gain more insights about customers’ behavior, discover new market opportunities, or figure out a way to improve operations to increase productivity and reduce costs, industry leaders want to continue moving their business towards a more efficient way to leverage data.
Naturally, before gaining value from data, it needs to be stored. For many years, industries have relied on on-premises data warehouses. These traditional approaches often required extensive upfront investment; enterprises were forced to buy all the hardware needed upfront, often resulting in expensive over-provisioning. And on-premises data warehouses, while performant, were also expensive to scale because of the tight architectural coupling of storage and compute. If an enterprise only needed additional storage capacity, they were forced to buy compute as well, and vice versa.
In traditional data warehousing, users extract data from many different databases, and that data is then normalized and prepared for query and analysis. In such configurations, data can be pulled from different sources by different business units within the enterprise, and then it has to be funneled into a central repository through the use of ETL processes to transform it into a usable form. Throughout this process, data has to be sent to a temporary database to be blended with other data and turned into proprietary formats usable by data consumption tools. As you can imagine, this process is complex and expensive.
In an effort to speed up performance, gain better scaling efficiency, and lessen the complexity of traditional approaches, we started to see the emergence of cloud-based data warehouses. These solutions are a definite improvement when compared to traditional on-premises alternatives. They are easy to start with small data volumes and grow. And they provide a way for users to independently scale storage and compute within the structure of a data warehouse. However, they are still just another data warehouse solution in that they are a closed and proprietary architecture. And while cloud data warehouses are inexpensive on a small scale, many enterprises find that costs grow much faster than expected as data and usage scales, particularly with the exploding volume, variety, and velocity of data today.
Savvy enterprises quickly came to realize that the entire data warehouse detour could be completely avoided, and they could just go straight to a modern, open, cloud data lake environment — one that not only gave them ultimate freedom and flexibility to control and use their data but also was much more cost-efficient at any scale. So what is this about? It literally is a big architectural shift, and it is creating significant value for data-driven enterprises because it allows for the storage, processing, and consuming of data in a much more cost-efficient manner.
Efficiency and Cost Savings
When adding a new data source or system to a traditional data warehouse, the complexity and cost associated with these changes are directly proportional to the size of the current data warehouse. In the past, we used to think that if we needed more performance, we just needed to add more resources to our data infrastructure. While that solution sort of worked for a while, it quickly translated into large data processing bills, and it made the ever-growing cost of data warehousing almost impossible to afford.
Nowadays, we are seeing innovations that allow you to experience cost savings of up to 75 percent by running engines on top of your data that not only accelerate your query results but also allow you to use cloud resources in a more elastic way. This means you only pay for what you use and avoid charges for both over-provisioning and idle compute.
Data Movement — A Relic from the Past
One of the most common pitfalls when using traditional or cloud-oriented data warehouses is having to implement brittle and complex ETL processes to make copies of data to satisfy different requests, thus slowing the performance of the end-to-end data and analytics pipeline.
Data warehousing is the fundamental element of a complex structure of data pipelines built from data extractions, transformations, blending, and integration processes that can easily become endless. The complexity of these pipelines means that every time there is a new data request, multiple copies of data will have to be generated and ingested, and if at consumption time there is a request for a change, this cycle will repeat. These traditional approaches are quite vulnerable to the effect of changes, as adding new sources of data is unnecessarily complicated and unjustifiably expensive.
Cloud data lake environments solve this problem by allowing you to keep data in its original storage location (i.e., in S3 or ADLS buckets), helping you save money and increase efficiency. Cloud data lake environments enable direct access to your data in a central place that makes it easy to find while eliminating the need to maintain copies of data in different warehouses and data marts.
Openness and Flexibility
Data comes in a wide variety of shapes and sizes. A major advantage of open cloud data lake environments is their flexibility when it comes to working with different formats of data. When using such an open approach, data can be stored in formats like JSON, ORC, or Parquet. The benefit of open file formats is that they allow you to minimize the obstacles of reusing this data through the pipeline, unlike proprietary formats where data can only be read using certain services of software solutions, which can be prohibitively expensive and have the potential to become obsolete.
Wrapping it Up!
The nature of data warehouses still has a fundamental underlying problem. Regardless of whether they are deployed on-premises or in the cloud, they don’t address the challenge of keeping up with the staggering speed and wide variety of data being created these days. Their closed architecture restricts flexibility and freedom, is expensive as usage scales, and doesn’t provide open access to data for best-of-breed processing. While data warehouses still have their place and are well optimized for certain use cases, enterprises are better served to avoid the cloud big data detour and instead keep the vast majority of their data in an open and cost-effective cloud data lake.
The elimination of data movement, together with the openness, flexibility, efficiency, and cost savings provided by cloud data lake environments, offers a kind of freedom that is extremely powerful. Savvy enterprise leaders are making the move to these new technologies quickly in an effort to skip over the complexities and overwhelming price tags that are associated with data warehouse-centric architectures.