Imagine you are assigned to extract sales insights from your data. Along with troves of corporate financials together with other market trends, you are also given access to hours of audio and video files of actual sales representatives speaking with customers. How do you process this in Spark?
Or, consider another scenario where you work for a marketplace and your job is to construct a consumer-facing product catalog. You have a database with hundreds of thousands of SKUs, stock levels, and item descriptions, while also hosting millions of product photo URLs provided by vendors. Some URLs are correct, others are broken, and there is no quality control for the photos whatsoever. Where do you start?
Normally, the go-to tool should be the “modern data stack” determined by a combination of data storage, data transformation, data ingestion tools, and various business intelligence tools.
Unfortunately, none of these things will be sufficient to complete the tasks outlined.
To understand why, let’s revisit data processing in the enterprise as it is today – of which the defining characteristic is that it is cloud-based. This was a major upgrade from the local processing mode of the pre-cloud era because we now enjoy infinite object storage, exponentially scalable compute, and a mind-boggling selection of modular data mart components. As it stands, the main features of the modern data stack include:
- Cloud-based SaaS with sharing, versioning, and data governance
- Scalable cloud warehouses
- Primarily SQL-based data transformations
- ML and AI insights through integration with ML platforms
At a first glance, this looks sufficient. However, the “modern” data stack is permanently split into two towers along the AI/ML demarcation line.
For example, on the side of the data warehouse, the language is SQL, but the lingua franca of ML is Python – good luck finding engineers who can excel at both. Likewise, data warehouses need CPU clusters, while ML models need GPU clusters. Finally, there’s the issue of handling unstructured data like conference call videos and product photos. While cramming these multimodal objects into the database tables is already questionable, tossing them over the fence to ML models is even more cumbersome.
This architecture still looks OK when the lion’s share of processing remains in SQL and AI plays a supporting role. When these roles become equal or shift toward AI, maintaining two separate data processing towers becomes untenable.
The inefficiencies of current practices (as outlined above) are the pull driver of the paradigm shift, while the push comes from the rapid evolution of foundational models that now can:
- Reduce time to value in analytics (no more custom ML models from scratch)
- Eliminate many big data requirements as they are amortized by foundational models
- Require less domain expertise (fine-tuning is much simpler than architecting)
- Reduce overall ML/AI costs (API calls are cheap when factoring in savings from not owning the training and inference infrastructure)
As we transition from wrangling structured data in SQL to crunching unstructured data in deep learning models, some things in the “modern stack” obviously will become obsolete.
Let me explain.
Sure, vectorized operations over structured data are not going away. But, the key innovation is that unstructured data objects should be accessible right from where they live – which is cloud storage. There is no need to extract audio from mp3 files, store it as binary columns, and replicate it in every table iteration. Cloud storage is a perfectly acceptable abstraction for unstructured data.
There is also little reason to cling to SQL given it was born to query columnar data and has nothing to do with ML. If data crunching shifts toward running large foundational models via APIs and local AI helpers on GPUs, then SQL drops in importance. The entire data warehouse can perfectly work under the guidance of Python – or any other data language of the future – and manage the compute resources automatically.
Finally, the entire concept of big data may become stale.
How?
Large foundational models feature rich representations of the world already, and fine-tuning them to specific needs takes much less data than previously imagined. Think of it this way: It does not take a factory worker to see millions of demonstrations to pick a new operation. In a similar manner, a sufficiently intelligent foundational model can be specialized into many tasks in a top-down fashion with ease. This squarely goes against the ML paradigm of the past where every task required a separate model and unique architecture, and the entire model cascade could only be trained when given access to millions of data points.
Enter the burgeoning post-modern stack.
Circling back to the beginning of this article, the newly prescribed workflow becomes obvious.
First, you should keep the multimodal data where it belongs – in the cloud storage. The discriminative and generative AI models you are likely to employ to do the job should be able to read unstructured data from storage, along with any metadata you may have.
Second, the results of applying the AI models at every step should be organized into persistent, versioned, and traceable datasets. This requirement is pivotal for iterating on your methods of data transformations:
Finally, the AI models themselves should be trivial to call, whether locally or via inference APIs, and should take and return the objects you can store in the datasets. Wrangling data with AI models should not involve switching from SQL to Python and back, making unnecessary data copies, or allocating static compute clusters.
Can you do this today in your current stack? If not, maybe it is time to rethink it.