Extract, Transform, and Load (ETL) describes the process of integrating raw data from various data sources into a repository such as a data warehouse, with the main purpose of maintaining Data Quality and trust. ETL requires three operations, as described below by Paul Varley:
Extract: “Getting a copy of data from a source, which could be an application, database, or text file.”
Transform: “Translates the source data to match the format of the target system. This includes changing data types, combining, or splitting fields, and applying more complex formulas.”
Load: “Completes the process by putting the transformed data in the target system.”
While ETL technologies have been popular since the 1970s, ETL as a strategy plays an important role in modern times with new technologies.
Beware of introducing unintended semantic drifts when employing ETL. As data is transformed, the meaning and the context of that data changes. This could cause issues in understanding the original purpose of the data or prevent staying current with new technologies or contexts.
Other Definitions of ETL Include:
“A process of connecting to data sources, integrating data from various data sources, improving Data Quality, aggregating it and then storing it in staging data sources or data marts or data warehouses for consumption of various business applications including BI, analytics, and reporting.” (Kartik Patel)
A means to “load data from different systems into a data warehouse for reporting and data analytics.” (Paul Varley)
“A technique in managing the movement and consolidation of data within and between applications and organizations.” (Data Management Body of Knowledge)
“The process of making inaccessible data available by extracting data from multiple sources and making it usable for cleansing, transformation, and finally, business insight.” (Talend)
A service that “simplifies the process for loading data.” (TechRepublic)
“A type of data integration that refers to three steps used to blend data from multiple sources.” (SAS)
Businesses Use ETL to:
Provide quality data sets for machine learning
Read (extract) data from a database, pull (extract) that data out, write (transform) the data into a target database. Especially for data warehouse applications
Merge data systems from different departments or companies in a single, reliable repository (e.g. for Enterprise Information Management)
Move data behind a firewall in batches, which may be useful for edge computing and the Internet of Things (IoT).