Click to learn more about author Edward Wong.
Data scientists and analysts spend most of their day interacting with data problems – preparing data for analysis, writing and testing algorithms, etc. However, every so often you will have to present and justify your work to management and answer their questions. After all, they’re likely paying you a hefty salary.
Unfortunately, most decision-makers don’t come from a data science background and their expectations and vision may differ from yours. Management tends to want to see the “latest and greatest” advancements in machine learning (ML) implemented in your products, regardless of its applicability to the desired business outcome. As a data scientist or analyst, it’s your responsibility to set expectations with management as to what’s possible and the best way to achieve results. Doing so is just as much a part of your job as researching the latest ML libraries.
Issues with ML Solutions
While it may seem strange to hear this from a data scientist, non-machine learning tools are frequently better options for achieving business outcomes than their ML counterparts. This is for a number of reasons. First, there’s the issue of maintenance. Just like your car or your house, algorithmic systems cannot simply be set up and forgotten about. They require regular maintenance to function at peak performance – algorithms must be synchronized, ever-changing machine learning libraries must be accounted for, etc. You need to help management understand that these systems are difficult for lay engineering staff to maintain due to their technical complexity. And given the dearth of data science talent, this places a hard limit on the ability of everyone except the Facebooks and Microsofts of the world to properly leverage ML systems.
Another issue is infrastructure. Unless your task is a small one or is meant to be executed only once, a significant expenditure on supporting infrastructure will be a necessity. Without this, it will be difficult to make an ML system ready for use in production. Management needs to be aware of the upfront costs, which may not outweigh the ensuing technical debt, before implementing such a system.
At an even more basic level, many businesses simply lack the data necessary to train and feed the kinds of algorithms management wants to see. It’s to this issue that we now turn.
The Importance of Data Hygiene
Business leaders often forget that machine learning algorithms are not a panacea that can be thrust into a given use case and expected to magically deliver value on their own. Algorithms rely on large, accurate, datasets to train and generate predictions. Data science is just the end result of a long process of data collection, cleansing, and tagging that requires significant investment. That’s why it’s important to have a robust Data Governance strategy in place at your business. Unfortunately, management often forgets this. Having failed to make the necessary investments in Data Governance, they nonetheless expect their data scientists to “figure it out.”
Even where management has made the necessary investments in Data Governance and you have access to a large, healthy, internal dataset, there are certain functions you will still have difficulty performing. These most prominently include anything that requires you to leverage customer data. The frequency of widespread breaches and scandals involving the misuse of data, along with the accompanying rise in government regulation, has made it more difficult than ever to leverage customer data within businesses’ ML systems. Data scientists need to make it clear to management that there are good reasons to not have access to customer data but that limited data necessitates limited results.
Focusing on the Minimum Viable Product (MVP)
A long standing idea within the “Lean Startup” methodology, the concept of “minimum viable product” (MVP) refers to the notion of creating the simplest version of a new product possible that will still solve for its intended use case. The idea is to get your product in the hands of customers as quickly as possible so that you can observe how they use it and learn from their feedback. You obviously don’t want to spend a ton of resources developing a bunch of features that customers either don’t want or have difficulty using as intended. The MVP concept offers the same value in an ML scenario for similar reasons.
As previously discussed, data scientists are a rare and expensive commodity and ML algorithms require constant tweaking. By following an agile methodology and releasing smaller, more frequent, applications you give yourself the opportunity to observe your hypotheses in the field and adjust as needed to achieve the desired business outcomes. Of all the concepts here, this is likely to be the easiest for leadership to accept. After all, management may not understand data science but they likely have a keen appreciation for dollars and cents.
The Engineering Hand-Off
We previously noted the difficulty inherent in engineering staff maintaining advanced ML models. But unless you’re an all-in-one developer and data scientist, the majority of infrastructure implementation will be left to engineering and a hand-off will need to occur. In order to ensure the process is as smooth as possible it’s critical to have managerial support for two things: creating documentation and setting up regular cross-team check-ins. Make sure to provide extensive documentation of everything you’ve built in the form of a Jupyter Notebook or some other system so that following your steps is as straightforward as possible. Remember, when in doubt, it’s always best to document. Cross-team meetings will provide you the opportunity to walk your colleagues in engineering through the documentation and assess their level of knowledge.
Getting the necessary time to do this work may prove difficult as management can be reluctant to devote technical talent to anything other than development. The onus is on you as the data scientist/analyst to make the case that an early investment in documentation and interteam planning will result in a tenfold return in the form of time saved later.
Concluding Thoughts
While popular imagination may picture a data scientist/analyst as someone who sits behind a monitor all day writing algorithms, in practice your role is very much that of a communicator. It will be up to you to convince management of the investments they need to make to ensure a successful ML program and the consequences if they do not.
Is this something you’ve experienced in your role? Please let me know in the comments below!