Data pipelines are like insurance. You only know they exist when something goes wrong. ETL processes are constantly toiling away behind the scenes, doing heavy lifting to connect the sources of data from the real world with the warehouses and lakes that make the data useful.
Products like DBT and AirTran demonstrate the repeatability and scriptability of data pipelines. They are the ultimate input-to-output headless IT middleware, like a router for IP packets but transferring data instead.
What could GenAI add to this tidy bit of infrastructure? It turns out, a lot. Combine the scriptability of data pipelines with a language model’s ability to generate code and you get dynamic, self-updating ETL processes. Using the language model’s ability to understand and correct errors, those ETLs can be self-healing when disruptions like a typical schema change, while numeric overflow or a full disk would have previously crippled the pipe.
One level higher in the stack, it’s not uncommon for errors to be introduced in harmonization when the input contract changes with, for example, the renaming of a brand or the addition of a new market. Traditional FLITE (for-loop-if-then-else) code would either fail to notice the issue (kicking the problem downstream) or error-out. When a language model is monitoring the ETL process, this sort of logical change can be detected and semantically corrected. When an issue does make it through the pipeline, it is easy to update the prompts that are managing the process. Data updates are the new “model drift.”
In this sense, language models looking at data can provide analyst-level services like data cleanup and harmonization. It’s a simple task for a language model to convert location names from one language to another. Or adjust pack sizes labeled in many different units down to harmonized grams. Or extract specific attributes from unstructured text. These all used to be human jobs, and there’s still a HITL (human in the loop) monitoring role. But the model never gets tired, it speaks every SQL dialect in an optimized way, and it always follows orders.
For the most part, at least. Large language models are still learning, so guardrails are necessary. For instance, a model can’t be trusted to convert text to SQL unless a human has already ensured the SQL’s quality. The tech is just not reliable enough yet. But we can curate foreseen patterns of ETL problems (we have a long history of examples) and show the model what to do in each case. Here we are using the model to understand and reason, but not generate dangerous real-world actions like dropping tables on its own.
One thing a model can do on its own is create a key that joins tables. Harmonizing well-known things like dates and addresses is easy for FLITE code, but it takes a language model to figure out the location or brand that should be joined to other tables based on human comments or OCR scans, or recorded voices. Language models are eating software, and folding sentiment analysis and named entity extraction right into ETL is just one example.
The data itself does not have to be sent to an expensive, remote, slow language model. Semantic matching (also known as “embedding” among those who train GenAI foundation models) offers an inexpensive and fast approach to high-flow data applications. Semantic matching works especially well for data because there is so much context (database, schema, type, etc.) known about the individual elements. The number 405 in a database is indistinguishable from other valid numeric values but if the table is “payroll” and the column is “num_dependents” a language model can flag the oddity.
But of course, if a language model is involved in any way, there will be concerns about where and how data is being sent and used. The good news is that there are now several proven options for air-gapped instances of very competent models (Mistral, Llama, etc.) that can be deployed to toil away in a secure location, providing value but never revealing their secrets. Even private instances of the largest models like GPT4V can be provisioned as securely as the cloud databases that ETL targets.
A ripe area for improvement in ETL is aggregation. Until now we have had coarse choices: either stream the detailed data into a warehouse or apply simplistic aggregation functions (sum, average, max, etc.) to grouped batches before storing. Language models allow us to consider event detection as an aggregate. Monitoring satellite images as they stream into a database, We could instruct the model to only save images with cyclone storms. In a security feed, we could save human conversations or loud noises but drop the rest. Save an event when a time signal becomes volatile. Keep five minutes before and after large sudden declines. These are all examples of event-based aggregation that a model can implement by generating verifiable signal/data processing code on the fly.
Continuing advancements in generative AI are streamlining the way we manage data pipelines. By leveraging the scriptability of data pipelines with the dynamic capabilities of LLMs, organizations can achieve self-updating, self-healing ETL processes that adapt to changing data landscapes with unprecedented agility. Moreover, the ability of language models to provide analyst-level services, from data cleanup to semantic matching, promises to transform data management efforts more broadly, offering efficient solutions to complex aggregation challenges and ensuring data integrity and efficiency.