Click here to learn more about Ted Kwartler.
Each semester my Harvard University Extension class is about half business and half computer science students. As you may expect, the business pupils struggle with learning how to code. However, computer scientists feel hindered by business case studies and learning how to interact in a cross-functional group.
In the maturing industry, data scientists are now expected to generate business value instead of merely being the most technically savvy person in an organization. Today’s focus is “data to value,” rather than optimizing to the sixth decimal point like a Kaggle competition. Yet, semester after semester, I still observe computer scientists believing that data, coding, and modeling alone are delivering business value.
It’s not their fault. Academia teaches with curated data sets, while the media extolls the benefits of having a data scientist. But there’s a big mismatch between building models and productionizing them, let alone delivering value. If you are a data scientist, ask yourself: How many of your models are in production? Are they fault-tolerant? Are the lifecycle management aspects like data drift defined in production? Or are you endlessly making complex models that demonstrate aptitude with no regard for implementation?
If you just realized you are on the wrong side of value creation, here is a best practice I share with my students you may find helpful.
Data “Gemba Walks”
When I worked at Amazon, the value of lean manufacturing’s “Gemba walk” was paramount. Gemba means “actual place” in Japanese; for a Gemba walk, you are asked to go to the actual place where the process occurs. Managers were expected to observe items being picked, packed, and shipped, then work on efficiencies in that process. A Gemba walk forces managers to have a depth of knowledge, not just a cursory understanding of a process.
The same holds true for the data scientist. Sitting at your desk, don’t just go into a database, type `SELECT *`, and assume your data is appropriate for modeling. The human element plays a big part in the data you are seeing.
For example, in the food delivery service industry, customer service agents apply a “disposition code” after hanging up with a complaint call. One call could be “Missing Item” while another could be “Order Tracking.” There are around 24 disposition codes to choose from. It makes sense to smooth the process by having a robust list of codes available, maybe initially ordered alphabetically. A data scientist could simply pull the data and train a model related to the most frequent customer complaint, thereby helping the food delivery operations. Except it’s not so simple! Turns out, the first code, “Disconnected Call,” was the most common. Call center agents are timed to the second. As a result, nearly all agents selected the first item from the list to save time.
If a data scientist didn’t sit next to an agent and observe how this data is collected, they could not know the data integrity issue. A simple “Gemba walk” to the operation itself would let them know the data is faulty, and a model would be worthless. Then, they could look for proxy data fields that could be more accurate or work with business partners to improve quality controls.
There are many more “best practices” and even formal training to help you. I encourage data scientists to recognize that they can provide the most value when working collaboratively with their non-technical stakeholders. Otherwise, Data Science will be a cost center and not a value-generator. Learn to prioritize “data to value,” not “data to a model.”