Creating a collaborative, data-driven culture is one of the most important goals of many modern organizations. However, access to the data and data processing tools remains restricted to the selected few technical users or upper management echelons.
Data-driven culture cannot exist without the democratization of the data. Data democratization certainly does not mean unrestricted access to all the organizational data. The aim of data democratization is that the data that can help an employee make an effective business decision should be available to them in a usable format, fast enough to access, and not require them to be a technical data expert to make sense of it.
Can Data Be Democratized by Ensuring Access to It?
On average, data scientists spend almost 80% of their time on data preparation. Is it a good use of their time?
Despite spending most of their time on data preparation tasks, most data scientists find these tasks to be the least enjoyable part of their job. This observation is coming from one of the most technically advanced communities of data users. Understandably, people want insights from the data but have to spend too much time on data preparation itself, thus hindering the propagation of a data-driven culture.
Can Centralized Data Engineering Services Help in Data Democratization?
To overcome this issue, traditionally, organizations have tasked a centralized data engineering team with creating an enterprise-level data warehouse or data lake. Analysts can then plug into this central data store to get insights. However, this model of data delivery is currently under strain. Let me paint a situation for you:
Just imagine that one needs an additional metric to complete their data analysis. However, this metric requires additional data to be processed in the data warehouse.
To get this serviced by a centralized data engineering team, employees need to wait for their turn, even if it is a minor change that they can write themselves.
There are three main issues with the centralized data engineering services:
- The tailor-made design and delivery model is not scalable.
- When a centralized team must write business logic within data pipelines, the ownership of the business logic gets diluted.
- As data pipelines are supposed to extract data from the source and also need to transform the data based on the business rules, these pipelines can become highly susceptible to the changes happening at both the source and the target ends. This tight coupling between the source and target often pushes up the data pipelines’ maintenance cost.
How Can the Data Engineering Teams Achieve Scale?
Interestingly, the solution to the scalability of the tailor-made delivery model also comes from the same fashion industry: readymade garments.
Fashion companies produce readymade garments in various standard fits and sizes as ready products. However, if a customer still needs minor customization, she can always get it done at a fraction of the cost and the time compared to a bespoke solution from scratch.
Building on a similar analogy, data engineering teams too can achieve scale by centrally managing the heavy-technical aspects while establishing technology, process, and people practices for enabling the minor customizations to be done by the business users in the self-service model. Let us see the components that can be centrally delivered as a managed service:
- Platform-as-a-Service (PAAS/SAAS) – The infrastructure to create data pipelines and perform analytics is centrally managed. It should ideally include the tools and applications as well. The platform should be highly scalable and should preferably be cloud-based.
- Data-as-a-Service (DAAS) – Data procurement, data quality, and performance are managed centrally by creating a centralized data lakehouse as a product.
- Data Governance – Data Governance takes a much more central stage in the decentralized delivery model because every governance lapse can be amplified by the number of users who can also directly manipulate the data. The data engineering team needs a renewed focus on quality, access, lineage, cataloging, etc., before access to the data can be opened up for the whole organization.
The only component that we plan to decentralize is the business-logic component of the previous data pipelines. The domain-specific business logic component should be decentralized to handle the technical users closer to the domain, such as data analysts and data scientists.
One way to promote data-driven culture can be by promoting a hybrid data delivery model. The hybrid delivery model can promote data democratization by providing self-service of business logic while centrally managing the heavy lifting of the data preparation.
The modern data stack is the driving force behind this concept of the hybrid data-delivery model. The most significant change in the modern tech stack that is also the essential message of this blog post is the separation of the data pipelines’ replication and transformation layers. Automated E/L tools are the class of tools that have source-aware connectors to connect to the various popular SAAS applications. These tools understand the metadata of the source and replicate data incrementally and in near-real-time. All these functionalities are available right out of the box without requiring any logic to be written. Such niche features of these apps help create replication pipelines that are quick, scalable, and low maintenance.
Automated E/L tools replicate the source data to a centralized data storage layer that could be a data warehouse/data lake or a data lakehouse. This centrally available data is now available to the data analyst community, and this is where the central data team’s managed services end.
DBT is the second most crucial technology apart from automated E/L tools that enable the creation of a hybrid data delivery model. Data analysts can leverage the self-service transformation functionalities to perform the business logic transformation on the copy of the centrally available data. DBT pushes the transformed and augmented data back to the central data storage.
Data science and reporting teams can now leverage this golden data, or it can also be written back to the transactional apps using the reverse ETL tools.
Why Is the Separation of EL and T Such a Paradigm Shift in Data Engineering?
- Decoupling – Decoupling of upstream and downstream changes brings down the maintenance costs of these pipelines significantly.
- Unboxing the black box – Decentralization of the business logic aids in speed and governance of the business logic. Analysts can now self-service the business logic while also having visibility as to why the data looks the way it looks.
- Intelligent integration – As we standardize the EL part of the pipelines to do just replication, we also enable this section to become much more intelligent and automated compared to the custom-made pipelines.
What Are the Most Significant Features of the Automated EL Tools?
- Hundreds of native source connectors – These tools come packaged with hundreds of highly source-aware connectors. These connectors understand the data objects to expect in the source along with the source’s metadata and the data model. This intelligence thus saves considerable developer efforts while increasing the quality of the pipeline.
- Automated metadata sync – The tools’ awareness about metadata helps it sync not only data but also the metadata. For example, if the source introduces a new column or changes the column name or the data type, no developer involvement is needed to replicate these changes in the target. The tool automatically detects the metadata changes and starts synching the new columns and the data to the target.
This feature helps to free up the developer’s time and makes the changes available to the end-user in minutes rather than weeks. - Incremental data refresh – Thetool automatically takes care of incremental refresh without needing embedded logic for change data capture (CDC). This functionality makes the pipelines lightweight, thus reducing the network costs while improving the refresh timings.
- Near-real-time sync – With the help of the above features, these tools can replicate data in near-real-time. Better refresh speeds improve analytics and enhance the user’s trust in data.
How Can DBT Help in Decentralizing Data Transformation?
The second crucial leg of the EL-T enablement is the decentralization of the transformation portion of the data pipelines. The transformations can be decentralized now because of the advent of the open-source tool called Data Build Tool or, more popularly, DBT. DBT tools can be considered as SQL on steroids. DBT can do everything that SQL can do plus some more. It has the following distinct features:
- SQL-like coding language – SQL can be considered the English of the data world. Since DBT uses SQL-like language for coding, it considerably reduces the learning curve for the new users.
- Reuse the processing power of your warehouse – DBT acts only as the abstraction layer to the data warehouse. It provides a window for developers to write their code while the actual data processing happens in the data warehouse itself.
This feature helps optimize resources and the cost by reusing them, unlike the ETL tools, which need separate processing resources. - Software engineering best practices – DBT brings the best practices of the software engineering world to the world of data engineering. Best collaborative coding practices such as inline documentation, annotation, version controls, macros, CICD, etc., are available in DBT so that SQL code can also be modularized, reused, and easy to maintain in the future.
- Automated online catalog and lineage – DBT can create an automated lineage and data catalog without needing a third-party tool or manual effort. This feature becomes vital for data governance as we pursue widespread data democratization.
Balancing Act
While the modern data stack provides many benefits over the traditional architecture, the whole ecosystem is still in its nascent stage.
Hence, I suggest performing a balancing act here and being cautious before introducing these technologies in mission-critical applications. My advice will be to follow the crawl-walk-run strategy as we move towards modernizing the tech stack.
Something more crucial to imbibe in our tech stack is the concept of decoupling of EL and T components of the data pipelines. For example, data engineers can use the existing ETL tools only for the EL part of the data pipelines while opening the transformation component to the end-users by leveraging database native technologies such as materialized views or exposing the logic to the reporting tool itself. Data teams can also perform this act of decoupling with their existing tools without worrying about the latest tool available in the market.
Conclusion
If there can be only one takeaway from this blog, it ought to be this: Separate EL and T components of the data pipelines to achieve scale in data engineering, leading to far better data democratization in the organization.
The modern data stack makes this end-user enablement a reality. Data engineering teams should embrace these technologies and delivery models to help establish a data-driven culture in their organizations.
References: