Click to learn more about author Michele Iurillo.
One of the most fascinating things I’ve found at my current organization is undoubtedly the declarative approach. It is something so simple and so powerful. In my first business intelligence endeavors, there were data normalization issues; in my Data Governance period, Data Quality and proactive Metadata Management were the critical points. But today’s organizations are so complex that they need much more than this.
This article is a result of my desire to get things right. The best thing about my job is the learning that different clients and projects bring me every day. But let’s go in order: What is a declarative approach? Going back to my old programming notes, the declarative approach is defined in this way: “Programs can be classified according to the paradigm of the language used to produce them. The main paradigms are imperative, declarative, and object-oriented.”
Without going much further, SQL is declarative. Launching a query means I want to get a specific result – I don’t care how you do it internally, since programs using a declarative language specify properties that the output should know and don’t specify any implementation details.
The most typical example is that of paella. If I go to eat paella at a restaurant, I simply ask for “a paella.” I do not tell the waiter I want them to fry chicken and rabbit with a little garlic, oil, and rosemary, and then put the vegetables, then put the tomato and the bell pepper, all with the broth and then pour the rice …
We are not experts at everything (although my paella, due to my years of living in Valencia, is not bad at all for being Italian) and when we don’t know how to do something, we delegate the job to someone who knows how to do it faster and better (especially if it’s paella). The declarative approach does exactly this. I connect all the metadata with their relative tables, read the transformation processes, and return the expected result. Adding a “D” to “ELT” makes a lot of sense.
A Lot of Data …
But it’s data – lots of data – structured, unstructured in lots of external sources, historical, or recent. Data that we have to govern and that has a good quality level. Data that we have to rectify, reconcile, keep historicized. Data that we have to “document” because it is the regulator that requires it of us and we have to have full traceability and the process has to be repeatable. Until yesterday, all of this was very complex, this data had to be cleaned up, adjusted, extracted, and copied, and so there are different approaches. The most historical and used was ETL.
Extract-Transform-Load (ETL)
The world is full of articles about the inefficiency of ETL processes, but ETL is not dead. In a recent Gartner webinar on data integration, this was made clear. But you don’t have to be Gartner to understand that ETL processes are still everywhere. Just to introduce the topic, ETL stands for Extract-Transform-Load. The term ETL describes the process of moving data around and manipulating it. ETL tools often have capabilities to connect to multiple data platforms and apply transformations to data in memory. Afterward, presumably, the result is recorded somewhere. ETL algorithms can also be written in most modern programming languages, but many organizations find this option less preferable due to the overhead of writing code from scratch and the unknown complexity of subsequent support.
There came a time when we realized that there was no point in repeating these loading processes, and new technologies and approaches provided us with CDC (Change Data Capture).
Change Data Capture (CDC)
CDC is one of the ETL models for copying data. It is used to check for any change in a record – if any of the values change or if a record is deleted. In the past, this model was implemented with ETL tools by comparing the last copy of the data with the source or by checking the update timestamp of the source record. The performance was bad and there was a huge possibility of missing some of the updates. CDC tools have changed this drastically, as they use transaction logs to track changes, so no change goes unnoticed and it doesn’t even affect the performance of the source database.
There are two different methods of detecting and collecting changes: data version control, which evaluates the columns that identify the rows that have been changed (e.g., last update columns – date/time columns, version number columns, status indicator columns), and by reading the records that document the changes and allow replication to secondary systems. CDC has given us many improvements, but then the cloud came along and changed everything. Storage and high-availability computing created a new scenario.
ELT, if We Load First …
The ETL approach was once necessary because of the high cost of on-premises processing and storage. With the rapid growth of cloud-based data warehouses and the plummeting costs of cloud-based processing and storage, there’s little reason to continue to transform before uploading to the final destination. In fact, flipping the two allows analysts to do a better job on their own.
In short, analysts can now upload data before transforming it – they don’t have to determine in advance exactly what insights they want to generate before deciding on the exact pattern they need to get – and that’s a big advantage.
Instead, data from the underlying source is replicated directly into a data store, which includes a “single source of truth.” Analysts can then perform transformations on the data as needed. Analysts will always be able to go back to the original data and will not undergo transformations that may have compromised the integrity of the data. This makes the business intelligence process incomparably more flexible and secure.
The Declarative Principle
Every day, in many daily activities and needs, we behave and interact with others and with the context around us according to a declarative principle: We communicate what we need and receive the result of what is requested, ordered, or commanded. In other words, we do not specify how a certain task should be carried out, we do not teach others how to do their work, but we merely state what should be done and the objectives to be pursued and achieved. And finally, we evaluate the results, measuring the effectiveness of the actions undertaken.
Although it has a simple concept as a basis, why do I have to take care of how to extract the data? Why do I have to know beforehand with which parallelism and with which resources I have to do it? With our company’s platform, the engine orchestrates and synchronizes the data control and processing plane with intelligent algorithms, enabling Data Management professionals to work in a self-adaptive metadata-driven environment.
First of all, each dataset used in processing is virtually re-exposed as if it were a table or a set of tables whose content cannot be modified after its creation, ensuring system consistency, efficiency, and control. It is the platform’s task to implicitly operate all necessary transformations to ensure that a set of data, available in any format, is properly mapped. This is because the whole elaboration process is decomposed into a set of “data engines” or functions able to receive as input one or n virtual tables, operate the appropriate transformations, and produce as output one or more virtual tables according to the model illustrated.
The functions can be of many types (e.g., Query, Script, Rule, R, Python, Masking, Profiling) and perform the processing as configured by those who define the solution, encapsulating the logic and the various models.
The configuration of the “data engines” involves the implicit definition of an oriented network of execution dependencies in which the system is able to autonomously organize the sequence of steps required to produce each output and the designer is not required to describe the algorithm.
Each object in the system can be executed individually and/or combined with others to facilitate testability, maintainability, refactoring, and incremental development.
Processing follows a declarative scheme that specifies the expected data sets in the output. In very complex cases, the engine is also able to actively modify the graph during the run based on the dynamic properties that determine it and that may be influenced by the previous steps – or, even, to generate and execute at runtime the data engines required for specific processing.
During a run, all data processing operations are performed at a low level on the server and according to the “set oriented” mode considering any complex set of data sets at the same time and not “one line at a time.” This allows the server to minimize burdensome data movement operations when performing any complex processing, to produce intermediate data that will be used many times, at one time, without the need to explicitly store them in supporting files or tables.
A declarative approach definitely allows you to focus on what you want to achieve – and that’s what matters.