Click here to learn more about author Jim Sawyer.
Sometimes, we just need to start. We need to ignore our internal excuses and just do something. But it can be tough to do that when the business issues we are trying to inform are nebulous and ill-defined.
Regardless of our personalities, inertia often confronts us as data scientists. When faced with a big, hairy problem at the outset, we can feel (justifiably) paralyzed when we don’t immediately see the path to an answer – or even the path that leads to the side road that leads to the path to an answer. Our entire arsenal of potential modeling methodologies and statistical techniques starts swirling around in our heads, taking turns knocking each other out of our consideration set. And the data… oh, the data. What kind of maddening issues are we going to find in this particular data set that will make us want to tear our hair out? How long will it take us to prep the data so we can even begin to analyze anything? And the client wants this done yesterday! It can be so hard to just…start.
This is where agile thinking is your friend. Rather than try to plan out the entire path in arduous detail, you need to ask yourself, “What can I get done in the next week to make some progress towards my eventual goal?” It’s a famous maxim in the agile community that you never know less about a project than you do at the start of the project. This is one of the most important points to remember in project planning, and one we too often forget to acknowledge. So, what can you do to jumpstart that learning process?
Often it begins with something as simple yet achievable as plotting your data, then summarizing it to understand the shape of its distribution and its five-number summary* as well as identify its flaws. Remember, your client probably doesn’t know the data set inside and out either, nor how to tactically leverage it to solve the problem posed. The client hired you, after all, to help figure it out. Any new knowledge about the data, however small it seems, represents a major improvement.
And you’ve learned something. Perhaps this opens your eyes to what the next two weeks will look like. For instance, “By next week, we can accomplish X, and then the week after, we can accomplish Y.” So next week, you set a goal to accomplish X. Maybe you get all the way there, maybe you don’t. But you’ve learned something along the way. And that newfound knowledge might lead you to muse, “Well, I had planned to accomplish Y next week, but what I’ve now learned suggests that it might be better to accomplish Z and then W instead.” So, go do that.
This rapid, continuous, iterative approach of making demonstrable progress toward achieving the end solution – even if the end solution isn’t obvious from the get-go – is one of the core tenets of agile data science. These iterations are often called “sprints,” a moniker derived from track-and-field that I’ve always appreciated because within the narrow context of each sprint, there is a finish line! An iterative approach means you and your team can enjoy that feeling of crossing a finish line multiple times throughout your project, relishing these small successes over and over again. This also provides the side benefit of knowing that you’ve accomplished something important at the end of each sprint, both for you and for your client. This can serve as a milestone to assure your client you’ve made encouraging progress toward the end goal.
So yeah, just start. Do something. Like, just one thing. Every single sprint, every single time. Then, use that new knowledge to figure out the best thing to do next. Rinse and repeat, and eventually you’ll be done, and you can bask in the satisfaction of a job well done and a happy client. Do it. It works.
*For the normal, non-nerdy folks, a “five-number summary” describes the general shape of a dataset by the minimum, 25th percentile, median, 75th percentile, and maximum values. It helps you understand the “spread” of the data – whether the values are packed closely together around a certain number, or whether there is a wide range of values in the dataset.