A collection of facts from which inferences can be made is called data. It is the basis on which factual information is derived, providing relevant results to the end users. Data is the cornerstone of contemporary society and is crucial to many facets of people’s lives. In order to gain knowledge and make wise decisions, facts, numbers, statistics, and other bits of information are gathered and examined. Data is crucial in various industries, including business, healthcare, education, and government.
The era of Data Science does not fail to amaze people to a great extent. It startles them awake and inclines them to use machines regularly. According to Statista, data creation has significantly increased from 2010 to 2022. It also estimates the data for coming years (2024 and 2025), showing a predicted growth of more than 180 zettabytes.
In the world of data, we have distinctive notions, such as data provisioning, data warehouse, data lake, and other related concepts. In this article, we will understand their theoretical and practical implications.
Data Warehouse and Data Lake
A warehouse is a depository where data is stored. A data warehouse is a specific kind of system for managing data created to facilitate business intelligence tasks, primarily analytics. These systems are focused on enabling queries and analysis and typically store significant amounts of past data. The data stored in a data warehouse is usually gathered from diverse sources, such as application logs and transactional systems.
A data lake is a “centralized storehouse” that lets you store all your unstructured data at any scale.
The difference between a data lake and data warehouse is that in a data lake, an organization’s data is stored in an unrefined or unstructured form and can be retained indefinitely for present or future utilization. On the contrary, a data warehouse holds data that has been refined or structured and is prepared for strategic examination based on pre-determined business requirements.
Data scientists and engineers usually use unstructured data from a data lake in its raw format to obtain fresh and distinct business insights. In contrast, managers and business end users usually access data from a data warehouse, which has already been structured to provide answers to pre-determined queries for analysis and to gain insights into business KPIs (key performance indicators).
Data Modeling
Data modeling involves developing a conceptual representation of data entities and their interrelationships. This process typically comprises various stages, such as collecting requirements, conceptualizing, designing logically and physically, and implementing. At each stage, data modelers collaborate with stakeholders to comprehend data requirements, identify entities, establish connections between data entities, and create a model that precisely represents the data, facilitating software developers and database administrators.
Levels of Abstraction in Data Modeling
Data abstraction plays the role of condensing a body of data into a simplified representation. There are three levels of abstraction in data modeling:
- Conceptual level
- Logical level
- Physical level
Conceptual Level: The conceptual level of data modeling is the highest level of abstraction. At this level, the focus is on understanding the business requirements and how stakeholders will use data.
Logical Level: The logical level of data modeling focuses on transforming the conceptual data model into a more detailed representation that can be implemented in a database management system (DBMS).
Physical Level: The physical level of data modeling is the lowest level of abstraction. At this level, the focus is on physically implementing the logical data model using a specific DBMS. The physical data model defines the database schema, including tables, columns, data types, indexes, and other physical storage details.
Major Data Zones
In a technical sense, ingestion and curation are two data zones in a data lake between other major zones. Data zones operate as “stage gates,” with every gate having a specific purpose. The unique characteristic of these gates is that there is no overlapping, meaning that all consumption patterns do not coincide throughout the process.
In a data lake architecture, a landing zone and a processing zone are two distinct areas used for multiple functions.
- Landing Zone: A landing zone is the initial storage area in a data lake where raw data is ingested and stored. It is the first stop for data extracted from various sources and is often unstructured or semi-structured. The purpose of the landing zone is to store data as quickly as possible without imposing any structure, formatting, or data quality checks.
- Processing Zone: A processing zone is an area in a data lake where data is processed, transformed, and refined into a more structured format that end-users or downstream systems can analyze. This zone is where data is cleaned, standardized, and enriched with additional metadata or context before being made available.
Overall, the landing zone is used for rapidly ingesting raw data, while the processing zone is used for refining and preparing data for downstream consumption.
Data Provisioning: Ingest and Curate
Data provisioning is moving or obtaining data from a source system to a target system without using the data warehouse.
In data provisioning, recent technological advancements combine the best practices of data, ingraining feasibility, and productivity. This ensures high-utility data reach people at the right time while showing compliance with legal and other obligations.
Ingestion and curation are two vital elements that contribute to data provisioning. Ingest, in a literal sense, means “consume,” and curation relates to organizing, administrating, and maintaining.
When it comes to data provisioning, bringing in large and varied data files from multiple sources and storing them in a single cloud-based storage location such as a data warehouse, data mart, or database for analysis is called data ingestion.
Data curation, on the other hand, is the process of creating, organizing, and managing data sets so that people who are searching for information may access and use them.
Data must be collected, indexed, and cataloged for users within an association, group, or the wider public. Data can be ingested and curated to assist corporate decisions, academic requirements, scientific research, and other demands.
Data Lake Processing Framework: Ingest, Curate, and Consume
The framework of data lake processing is a standardized construction of how a data lake takes data and subsequently brings the taken data to a mature state. It publishes the data so the applications can employ it.
In a data lake processing framework, “ingest” and “curate” are two critical stages that collect and prepare raw data for analysis.
Ingest:
- During this stage, data is typically collected from different sources such as databases, file systems, streaming data sources, social media, IoT devices, etc., and loaded into the data lake.
- The two types of ingestion processes are batch and real-time ingestion. In batch ingestion, data is collected at regular intervals and loaded into the data lake.
- Data is collected continuously and loaded into the data lake in real-time ingestion. The main goal of the ingestion process is to make sure that all the data is collected and stored in a scalable way. Data ingestion also involves data validation and verification to integrate the data.
Curate:
- Once the data is ingested into the data lake, it must be curated or prepared for analysis. Curation involves several activities, including cleaning and transforming the data.
- Cleaning involves removing any irrelevant or duplicate data, correcting inconsistencies, and identifying missing data. Transforming data entails putting it into a common format or structure so that it can be quickly queried and analyzed.
- The curation process also involves applying security and governance policies to the data to ensure it is protected and compliant with regulatory requirements.
- All in all, the ingest and curate stages are essential components of the data lake processing framework because they designate data collection, storage, and preparation without compromising its scalability.
Consume:
- The processed data can be consumed by various applications, tools, or users. This can include generating reports, creating visualizations, feeding machine learning models, or integrating with business intelligence tools.
- In a broader sense, it can also pertain to experiencing, engaging with, or enjoying content, media, or information.
- The processed data can also be stored in different formats or pushed to downstream systems for further analysis or consumption.
Data lake processing frameworks often provide features for data governance and security. This includes managing access controls, enforcing data privacy regulations, auditing data access and usage, and ensuring data quality and lineage.
These are designed to efficiently consume and process data stored in a data lake. These frameworks offer tools and features for processing, converting, and examining significant amounts of organized and unorganized data.
Conclusion
Data provisioning includes critical phases in the data management process that ensure high-quality data is available for analysis and decision-making. Data provisioning involves identifying and accessing data sources; data ingestion involves collecting and processing data from those sources; data curation ensures that data is properly organized and clean. By following best practices in data provisioning, ingesting, and curation, organizations can ensure that their data is reliable, accurate, and efficiently backs their business objectives.
References
Batch Processing vs. Real-Time Data Streams. (n.d.). Confluent.
Data ingestion. (2021). Cognizant Glossary.
Data lake architecture: Zones explained. (2023). CapitalOne.
Pratt, M. K. (2019). Data Curation. TechTarget.
Volume of data/information created, captured, copied and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025. (2022). Statista.
What is a data lake? (2022). Amazon Web Services.
What is Data Modelling? Overview, Basic Concepts, and Types in Detail. (2023). Simplilearn.