Fundamentals of the Data Lakehouse

During the last few years, a new concept in Data Architecture has emerged. It is called the “data lakehouse.” The data lakehouse offers a new paradigm that takes the best characteristics of data warehouses (large amounts of coordinated data) and data lakes (massive amounts of uncoordinated data), and merges them, providing improved controls and tools. Some of the key technology advancements supporting the development of data lakehouses include:

Metadata layers for working with data lakes
New query engine designs for SQL searches on data lakes
Optimized access for research and machine learning tools

Historically, researchers have wanted to combine the efficiency offered by data warehouses with the broad range of information supported by data lakes. Merging data warehouses with data lakes, to create a lakehouse, results in a single system that allows researchers to move more quickly and efficiently, without the need to access multiple systems. Data lakehouses support both SQL systems and unstructured data, and have the ability to work with business intelligence tools. Modern businesses find the diverse data applications, which include real-time monitoring, SQL analytics, and machine learning, to be quite useful.

Data warehouses were developed in the late 1980s and are a great place to store “structured data.” They are relational databases designed for queries and analysis, and normally contain historical data that has been taken from transactional data. Data lakes, on the other hand, are nonrelational, centralized, consolidated storage spaces for raw data, such as structured, semi-structured, and unstructured data. Data lakes do not need a predefined schema, but as a result, their query responses are not as reliable, and they do not support ACID. Data lakes were developed around 2015, to save data that might have value. Data lakes quickly became popular for big data research.

The original use of the term “data lakehouse” is attributed to a business called Jellyvision (a Snowflake customer). Snowflake picked up on the name, and promoted it in 2017, describing their efforts to combine structured data processing with schemaless system. AWS then started using the term to describe its “lake house architecture” data and analytics services. One of the key strengths of the data lakehouse is called a structured transactional layer, which was developed by Databricks in 2019.

Early efforts to develop data lakehouses were clumsy, limited, and not terribly impressive. This is why some researchers have expressed a low opinion of the concept, and have questioned the value of lakehouses. (It should be noted, early efforts/experiments often meet with criticism and naysayers, but worthwhile models will typically improve with time and effort.)

Data Lakehouse Problems

Currently, there are situations when lakehouses are simply not as efficient as data warehouses, which have had years of investments, as well as real-world deployments, built into them. Additionally, researchers may prefer certain tools (IDEs, business intelligence tools), which will need to be incorporated into a new data lakehouse. Lakehouses are still in the early stages of their evolution, and these are the two basic problems they have:

The technology is still underdeveloped: More mature versions of lakehouses will include significant amounts of machine learning.

Monolithic structures: Data lakehouses control merged data lakes and data warehouses, forming massive, monolithic structures. These oversize lakehouses can become inflexible and difficult to work with.

Co-founder and CTO of Immuta, Steve Touw, describing their lakehouse platform, said,

“As organizations around the world increasingly embrace lakehouse architectures in the cloud, they are dealing with inconsistent access control policies for data security and privacy across different technologies. Faced with these new challenges, there is a critical need to provide consistent and stable cloud data access control. Our latest release offers data engineering and operations teams a single, universal access control platform to simplify and scale analytics access without compromising security or privacy control.”

Data Lakehouse Benefits

Gaining business intelligence by processing unstructured data, including video, audio, text, and images, has become a necessity for businesses. Because data warehouses are not designed for unstructured data, a number of organizations have chosen to simultaneously manage multiple systems (several data warehouses, a data lake, other specialized systems). While this tactic does resolve a number of problems, it is clumsy, inefficient, and wastes money. Also, maintaining a variety of systems can slow efforts to gain useful and timely business intelligence.

The data lakehouse is designed to reconcile the structured data, stored in columns and rows, with the unstructured data typically thrown into data lakes. Ori Rafael, the CEO of Upsolver, and a cofounder, said:

“With a lakehouse you’re getting the cost advantages of a data lake, but you’re managing to use the engines you’re already using today, providing easy access. A lakehouse is the data lake without all the limitations and the difficulty to access the data.”

Generally speaking, a single data lakehouse has several advantages over a multiple-solution system, including:

Tools have direct access to data for purposes of analysis
Administrating becomes easier and more efficient
There is less confusion about the schema and Data Governance
Less time is spent moving data around
A reduction in redundancy
Eliminates stagnation in data lakes, which can quickly become data swamps if left untended
Supports real-time end-to-end streaming. Used to refine, access, and analyze data types, including video, audio, images, and text.
Supports diverse workloads, including machine learning and analytics

Snowflake

Snowflake is a flexible lakehouse platform that allows traditional business intelligence tools to be used, and also supports newer, more advanced technologies, such as artificial intelligence, machine learning, and data science. The platform combines data warehouses, data lakes, and subject-specific data marts to provide accurate information, which can, in turn, support a variety of projects. The Snowflake lakehouse is an integrated platform capable of performing many functions, including:

Apps development
Rapid data access
Analytics
Data engineering
Creating AI and machine learning models

Databricks

The Databricks Lakehouse Platform provides the Data Management and performance normally offered by data warehouses, but with the low-costs of data lakes. Their unified platform simplifies the architecture by eliminating data silos and they developed the structured transactional layer in 2019, which provides governance, quality, structure, and performance. Their lakehouse supports:

Data engineering
Business intelligence and SQL analytics
Machine learning
Real time data applications

Amazon Redshift

The Amazon Redshift lakehouse platform supports research across data warehouses, data lakes, and operational databases. With this architecture, data can be stored in open file formats in an Amazon S3 data lake. This arrangement makes data easily accessible to machine learning and analytics tools, rather than shifting it to a silo. The Amazon Redshift lake house architecture supports:

Easy data lake queries using open formats
Familiar SQL statements that can combine and process data taken from all data stores
Executing searches on live data in the operational database without data loading and ETL pipelines

The Future of Data Lakehouses

Data lakehouse architecture offers the ability to manage data in an open environment, while blending a variety of data formats from all parts of a business. While reviews of its earliest versions may communicate doubts about its efficiency, it seems to be gaining popularity as a more efficient way to store and process large volumes of unstructured, structured, and semi-structured data. There are clear performance and efficiency advantages in using data lakehouses, and, predictably, these will continue to evolve as the system advances, and new apps and tools are developed.

Juan Harrington at Omnitech recently wrote:

“The Lakehouse is a new architectural approach to solving some of today’s problems of analytics and machine learning at a large scale. Although it is still in its infancy, the Lakehouse will continue to evolve and mature.”

Image used under license from Shutterstock.com

LEARN MORE ABOUT OUR PRIVATE CDMP TRAINING

Data Topics