Click to learn more about author Mathias Golombek.
In my last blog post, I introduced the data mesh concept and explored the link between data democratization and data mesh. Since then I’ve had lots of interesting conversations on the topic with colleagues and customers. In particular, I interviewed one customer who has been on a very interesting data mesh journey. Unfortunately, NDAs prevent me from disclosing exactly which company I’m referring to, but the insights were so fantastic I didn’t want them to go to waste.
Below is the conversation I had with our customer about their organization’s recent data mesh journey.
Mathias: Can you kick things off by telling us about the business challenges you were facing that led you towards data mesh?
Customer: We have some great predefined analytics use cases – for things like dynamic pricing and demand forecasting – but our data is often only available/visible through the lens of these use cases. If our data scientists want to explore the data, create new algorithms, or answer one-time questions it’s a struggle. They need more flexible, freeform access to the data.
It’s similar for non-human downstream systems (e.g., a planning system that uses batch data to perform assortment planning for our stores). Here we’ve built complex point-to-point interfaces, but these are hard to manage, add operational complexity, and burden the data warehouse.
We wanted to make our data more consumable and independent of the original use case. Ultimately, we want to move towards more of a data marketplace where you go to find the data, subscribe to it, and then interface the data into your system.
We also want to shift from a centralized to a more federated data ownership model. We see this as a way to empower the domains to own and create their own data products and make them directly consumable as close as possible to the source.
Right now we have a central data engineering team that curates all the data on a central data warehouse or data lake that’s consumable for our use cases. The downside of this is, of course, scalability. Wherever there is a central team, sooner or later you get bottlenecks. You also lose domain know-how. A central team never has the same rich domain knowledge as the teams that are producing the data and running the business processes. We want to put the data into the hands of the people who have skin in the game.
With our business model dramatically accelerating in e-commerce/direct-to-consumer, our data models have become increasingly specialized. Having all the know-how in a central engineering team just doesn’t fly anymore in a digital world. We believe the data mesh has the answers.
Mathias: So, can you describe your data mesh journey so far? Where are you on it? And what does the future look like?
Customer: It started about a year ago when we read the famous article from Zhamak Dehghani. It chimed with all the challenges I was just talking about.
Initially we spent 2020 framing and shaping the data mesh journey. Up until then we didn’t talk or think about data as a product – it was just a fuel that drives a report or algorithm. We asked ourselves: What does a data product mean? What does DATSIS [discoverable, addressable, trustworthy, self-describing, inter-operable, and secure] mean for these data products? How do we translate these DATSIS principles to our latest discoverability? How do we wire data products to our enterprise data catalog or our Data Quality frameworks?
We defined the high-level objectives. There are four in total. One of these objectives is to be able to produce a data product within one day. As I mentioned above, we want to decentralize the ownership and the creation of the data products so we’re not dependent on a central team anymore. However, in doing so, we cannot assume that all the domains have highly specialized data engineers. We need to make it as simple as possible for those decentralized teams to create a trustworthy data product. From the engineering perspective we need to automate as much as possible of the process of creating tables, data pipelines, code containers, setting up a CI/CD pipeline, etc. All that needs to be doable in one day and hidden behind some magic buttons for the user.
Another objective – more about data product consumption – is for it to take five minutes from initially discovering a data product to running the first meaningful query against it in a data lab. This will involve proper cataloging of all of the data and then, directly wiring the data catalog with a data lab to create a library of predefined queries and results.
In 2021 we went live with our data mesh approach.
So far, we have successfully created a data pipeline framework in AWS that supports our objective to create new data products in one day. We have automated the infrastructure, compute storage for a data product, and how to create data product itself (DDL statements driven by metadata, data pipelines, data pipeline creation, driven by metadata scheduling, driven by metadata, etc.). We have already created a lot of engineering abstractions that make it simpler to create a data product in the AWS stack.
We have also created an automated DATSIS scorer that allows us to describe our data products through metadata and build a confidence/maturity rating. We use YAML files to describe the data production on an abstract basis. So it describes who the owner is; it links you to where to find the business documentation in the data catalog. In the future it will also describe the link to the Data Quality Framework – so, which kind of Data Quality checks are implemented against the data and it gives you various interface types (e.g., JDBC-ODBC, streaming, files access, etc.). The resulting DATSIS score tells you how mature your data product is and to what extent your DATISIS criteria is met, which should give you a confidence level.
Over the next few years we’ll be turning all relevant data assets into products. The DATSIS scorer will be important in telling us how far along the journey each data asset is.
Mathias: Are there any particular learnings or insights you’ve got about embarking on a data mesh project that might be useful for somebody else in your shoes?
Customer: From a technical perspective my advice would be to start with metadata in mind because what’s not there in metadata you can’t automate later on. You need to have a good idea of how you describe a data product, what kind of metadata is required, and how do you maintain and manage that metadata. Starting with a good metadata model in mind will drive the creation of your system artifacts. This is critical because you cannot assume that you have a sufficient amount of data engineering experts located across all business domains. The key is to have self-service service tools in place that generate physical artifacts in systems going forward. This is only possible through metadata.
From a business perspective, we have embarked on a journey to become product-led. This has the goal to get the IT solution and the business domain closer together to ensure we build products the consumer wants and create value. In this context, we also look at data as products, which traditionally was not the case. Here we anticipate the change management involved in transitioning to a federated ownership model particularly difficult. It requires a shift in mindset on data accountability. I don’t have a silver bullet for this, unfortunately, but it’s something to be very mindful of. Having a central engineering team is convenient. But as I mentioned earlier, you get bottlenecks. When you shift to a federated model people have to develop the mentality that accountability starts with me as I’m producing the data. My accountability should not stop at the boundary of a system, I should also take ownership of the value my data produces in the wider ecosystem.
In our case, our business’ challenges are amplified by the fact we were not born in the digital age. We have a long heritage and legacy systems. Although we have dramatically accelerated our digital transformation in recent years, we are not a company like Amazon that was born out of data and had a clear data-driven mindset across all business domains from the very beginning.