Click to learn more about author Dipti Borkar.
If Amazon can deliver an order in 1 hour, why does it take days or weeks for data scientists to access their datasets? Embrace chaos, embrace data silos, orchestrate, and deliver.
As I head to AWS Summit today, I’ve been thinking about the beginnings—how it began for Amazon, and the magic that made it happen.
Turns out, there are quite a few similarities between how Amazon fulfills its product orders and data working sets that a data engineer needs to make available to data scientists and analysts. To begin with, they are both stored in a warehouse. Before I talk more about the data world, let’s talk a little more about the basics—how Amazon, the largest retailer in the world, fulfills millions of orders submitted every day.
I was amazed to find out that since the early days, Amazon has embraced chaos in the way it organizes its warehouses. You would think that given how efficient Amazon is, a traditionally organized warehouse would be a big part of what makes things happen. Oddly enough, what makes Amazon’s warehouses work is because they don’t organize inventory—it’s completely random. Products/items are distributed and stored everywhere: toothpaste sits next to a bottle of wine which is nestled next to a blender. As of 2019, Amazon has more than a billion products that it sells and delivers via Amazon Prime, and millions of orders are delivered daily. So while items are everywhere, the most important aspect is that they track and know exactly where every item is located.
A picker (historically a person, but increasingly a robot) puts together an order (a few at a time of the millions of orders that come in) made up of many different items using a map of where the items can be found in the warehouse (in an optimized way, of course). The order eventually gets delivered to you, on time (most of the time).
Now while this worked for the 2-day shipping requirements of Amazon Prime, Amazon Prime Now (1-hour delivery) forced Amazon to take things to the next level. These are the most popular, most important 1 million products of the more than one billion products Amazon has. In this case, the important bit is that the fulfillment centers are physically closer to the buyer, in their locality, and stock only these million popular items. A picker again puts together orders, each containing many items that get delivered to you in less than an hour.
You can actually apply a lot of these concepts to the world of data or Big Data as we have been calling it.
You’d think that delivery of requests (orders) for the required dataset (an order with lots of items in it) by a data analyst/data scientist (a Prime subscriber) would be so much easier in the virtual world. It definitely would be faster than the 2 days that Prime shipping takes. You would think that even a 1-hour (Prime Now) turnaround would be easy; in fact, you would think it would be delivered on demand as needed.
But that’s not the case. Data is increasingly everywhere, spread out among many different storage systems, different data centers, different regions, different clouds. Instead of embracing chaos, technologies have been trying harder and harder to organize it and shove it into a single data lake. In addition, the “picker” that needs to fulfill a data analyst’s request to put together a dataset from many different sources is a data engineer who doesn’t have the tools to automate putting together a dataset request. Instead, he or she ends up copying data around again and again. Delivery of datasets can take several days and sometimes even weeks.
Delivery of datasets on demand is really hard to achieve unless you optimize the orchestration of the end-to-end process, much like Amazon does.
To begin, embrace chaos by embracing data silos. It’s actually ok that data is all over the place; what matters is that you know where it is (like in the Amazon warehouse) and you can orchestrate the movement of it, as needed, to fulfill requests (like the picker in the warehouse). In addition, the most important data and the most popular data need to be as close as possible to the consumers, (that is, close to compute), in order to achieve data locality (much like the Prime Now fulfillment centers in key cities).
This is what data orchestration is. It is a new approach to building a modern, disaggregated analytics stack. By orchestrating active working sets for computational frameworks, on demand is now possible. And just as the retail industry turned upside down as soon as people realized they could order something and get it tomorrow, I believe that the data industry will turn around, now that data orchestration can deliver data requests on demand.