Click to learn more about author Joan Fabregat-Serra.
It has been ten years since Pentaho Chief Technology Officer James Dixon coined the term “data lake.” The term and its underlying technology have been thriving more than ever.
While data warehouse (DWH) systems have had longer existence and recognition, the data industry has embraced the more recent repository, the data lake, especially after the growth of big data, the shift towards cloud storage, and the implementation of artificial intelligence (AI) technologies.
One can argue that the advantages of data lakes include:
- Faster Access: Data lakes can be readily accessible to users allowing them to achieve real-time analytics.
- Adaptability: Data lakes can store small-scale or gigantic volumes of data (even petabytes).
- Flexibility: Data lakes are capable of working with various data types and data sources.
- Cost-Effectiveness: Cloud data lakes are more affordable compared to on-premise data lakes.
The appeal and novel capabilities of data lakes posed a huge threat to traditional data warehousing (DWH) systems. The main drawbacks to DWHs include high costs associated with rigid internal structures unadaptable to the evolving data environment, and DWHs can be time-consuming regarding the design and build-out of complex data storages.
Nonetheless, DWH solutions have adapted competitively by also offering cost-effective cloud storage options and making interfaces and features more discernible and simpler. Moreover, the need and demand for DWH are still high with benefits that include:
- Efficiency: DWH data is structured and can be retrieved within milliseconds.
- Trending Analysis: Because DWH is designed for query and analysis, it contains historical data that allows users to answer a set of predefined questions over time.
- Governance: Since many DWH systems follow a methodology (such as Kimball or Inmon) based on internal data standards and policies, this helps data users agree on rules, standards, and interpretations.
While it is true that data lakes’ new paradigm perfectly suits AI needs when facing big data problems, many analytical or business users are better served with structured data. Therefore, hybrid solutions combining both structure and semi-structured data systems are increasing in popularity.
Nowadays, DWHs and data lakes are well-recognized storage repositories in the data industry. Depending on the business use, data lakes and DWH can serve different purposes and offer various advantages.
However, both storage systems still have one common unsolved issue: Data Quality. The famous 80/20 Data Science dilemma where 80 percent of time is spent cleaning and 20 percent of time is spent analyzing still holds true regardless of your data storage choice.
The main difference regarding Data Quality is that data cleansing occurs after the data is loaded into the data lake, while Data Quality processes are implemented prior to data load in the DWH; this will make time spent working on improving Data Quality similar in both scenarios.
Platforms have been created to address this common Data Quality problem that consumes large amounts of engineering hours across data teams.
Data Quality is very important for both types of data storage systems:
- Data Quality in Data Lakes: This can be achieved by applying quality rules regarding GDPR or other data-related laws to prevent “dirty” data values to feed into AI models or enforcing data delivery SLAs from data providers.
- Data Quality in DWH: To make DWH’s integrations faster, it is crucial to be able to add layers of quality in matters of minutes, speeding up integration processes and Data Quality insights.
Many organizations are adopting a hybrid storage system solution, which makes it more important than ever to have a consistent view of Data Quality across all storage systems. Thus, implementing Data Quality tools that can work in hybrid scenarios is crucial tooptimize data systems, empower data teams and business units, and hopefully reverse the 80/20 rule to 80 percent analyzing and 20 percent (or less) cleaning.