Click to learn more about author Mark Hensley.
In 2012 former Chief Data Scientist of the United States DJ Patil and Babson College Professor Thomson H. Davenport famously declared that the job of Data Scientist would be the “Sexiest job of the 21st Century.” While that may true, most Data Scientists spend the biggest chunk of their day on their least favorite part of the job: data cleansing and preparation.
That’s bad for both the business and the Data Science professional. Qualified Data Scientists are rare and expensive, businesses waste resources and Data Scientists become frustrated when they’re forced to spend excessive time on what’s sometimes referred to as “Janitor Work.” As a result, a whole ecosystem of offerings have emerged or shifted their focus in an attempt to automate this work, from proprietary offerings like Trifacta to open source alternatives like Talend. While these are important, no software can take the place of a mature Data Profiling program.
At a high level, “Data Profiling” refers to the process of collecting summaries and statistics of data from a particular source – think of it as a kind of data “audit.” While the reasons for pursuing Data Profiling are varied, it’s typically done as part of an overarching Data Governance Strategy in order to uncover errors and inconsistencies within datasets. The ultimate goal is to discover ways in which data is being entered or processed incorrectly so that they can be rectified.
You’ve probably heard the expression “garbage in, garbage out,” that’s as true in Data Science as it is anywhere. It doesn’t matter how powerful your models are if you’re training them on bad data. If you’re trying to run pattern recognition with the assumption that the field for US state is written with a two-letter abbreviation, a user who has filled out the field as “Alaska” will throw off the entire process. Although you can always manually comb through every value in a dataset to ensure its integrity, that’s not a scalable solution in the long-term. Instead, it’s important to leverage Data Profiling as a means of discovery.
There are two schools of thought regarding how to get started: manual and automated profiling. The process of manual profiling is how most people get started. It’s essentially a “sniff test” that requires some degree of knowledge of what your data “should” look like and whether your profile roughly maps back to expectations. For example, when looking at the distribution of values for numeric data types what are the min/max/average/mean/median values? What are the most common? Does it make sense for 90% of prospects to be working in organizations with 5,000 or more employees?
Automated profiling is where some of the tools I previously mentioned come into play. This kind of profiling is typically done through software acquired from an external vendor. The idea is to proactively identify anomalies that may escape the notice of even an experienced Data Scientist.
Regardless of you how you get started, the process of Data Profiling will allow you to construct a dashboard that provides a clear picture of your data’s health and share it with other stakeholders. For example, at the most basic level, the number of NULL values as a percent of the total number of values is often a good proxy for the overall health of a dataset. From there you can examine other potential issues such as validity. In our previous example, a value of “Los Angeles” in the state field is obviously not valid and needs to be rectified.
Techniques such as cross-field and business rule validation will allow you to easily identify these kinds of errors. The former by checking values between fields e.g. if the country field is listed as “United States” then the state field obviously can’t be “Ontario.” The latter by verifying that syncs between systems – such as between your CRM and marketing automation solution – are occurring on a 1-1 basis.
As a practitioner for many years, I can tell you that the solutions to problems of Data Quality often aren’t rocket science. The culprit is typically user errors during data entry, which result in unclean data at the database level. In our state example this may be as simple as your sales team not properly filling out the state field in Salesforce when they meet a new prospect. The solution may be as simple as changing the rules in Salesforce so that the field for “state” must be entered as two letter value.
However, regardless of their source, you’ll likely need buy-in from stakeholders in business and IT to resolve these issues. Data Profiling will prove to be your best friend by providing other stakeholders with tangible evidence of ongoing issues and their potential ramifications for both business and IT operations. You can review critical issues as a team and determine what should be done to reduce errors in the future. By ultimately empowering you to solve the problem of unclean data at the source, Data Profiling will save you a great deal of time and heartache in the long run.