Click to learn more about author Paul Barba.
As the type and volume of data we deal with increases, so does the challenge of managing and cleaning it. Unstructured data, data from multiple sources, and data whose values continually change all create “messiness” – datasets riddled with noise, inaccuracies, and duplications.
This is a big problem for organizations that rely on data to make business decisions or maintain their day-to-day operations. Messy data can lead to faulty analysis that results in costly errors, customer attrition, poor decision-making, reduced productivity, or incorrect assumptions about products, services, customers, or the market. These issues are bad enough when you have humans working on small datasets, and only compound when you have machine learning or AI models crunching huge amounts of data and drawing trends accordingly. Clean data, on the other hand, dramatically reduces these issues, paving the way for better decision-making and more profitable outcomes.
Here’s how to clean your messy data to make it a valuable part of your day-to-day operations.
1. Standardize input formats
If you’re dealing with unstructured – i.e., text-based – data or data from multiple sources, chances are you’ll find plenty of inconsistencies across your datasets. To minimize complexity, you’ll first want to standardize all the data you have sitting in Excel, JSON, PDF files, websites, or databases. This involves checking and fixing spelling and formatting errors, applying naming conventions, and confirming date and number formats. For example, is 08/07/12 in MM/DD/YY, DD/MM/YY, or YY/MM/DD format? And unless you want to risk repeating the metric/imperial confusion error that led to the loss of NASA’s $125 million Mars Climate Orbiter, you’ll also want to confirm that your units are consistent across your data.
2. Bring data into one location
The point of standardizing your data is to make it coherent so that it can easily be consolidated into one single structure. Consolidating your data lets you figure out how each data point relates to the others. For example, it can help you see which ticket in a customer management system goes with which row in a financial spreadsheet, or how to group the different channels through which a customer might communicate with you. As a result you’ll end up with a single, central pool of data that lends itself to easy management and analysis.
3. Deal with duplicates
If you’ve consolidated multiple data sources, it’s likely that you’ll encounter duplicates. Data scraped from the web is also likely to contain plenty of near duplicates thanks in part to the prevalence of article “spinning.” The Associated Press and Reuters, for example, put out articles or news releases that are widely republished in slightly varying forms across a number of different publications. Several versions of the same article can easily end up in your machine learning model’s training data and evaluation data, giving you an inflated accuracy number and a false sense of how well your system works. When cleaning your data you’ll want to account for both exact and near duplicates, as both can affect the validity of your data.
4. Work around missing data
Once you’ve standardized and consolidated your data, it’s time to screen for and decide how you’ll handle missing entries. It’s easy to become myopically focused on what’s missing and the specific task you’re trying to accomplish right now. But a good dataset can be used for a variety of tasks, and missing data might become available later. The best approach is twofold: be as complete and accurate in your data storage as possible, and wait to fill in any missing values only when it’s time to actually perform your analysis.
When you get to that point, there are a few ways these values can be filled. Ideally you want to track down the real values, but if you can’t, think carefully about what’s missing, why, and what those values are likely to be. While methods such as exclusion and imputation are popular for doing so, these work on the assumption that data is just randomly missing. But this isn’t always the case. Say you’re working with revenue numbers, and you have those for publicly traded companies but not privately traded ones. Imputing the figures for the latter based on the former will give you numbers that are orders of magnitude larger than the real values.
Since filling in missing values involves making assumptions about the underlying relationships in the data, it’s worth collaborating with individuals who work with this data every day. They are likely to know about the hidden causes that lead to missing data, might know about other sources of the information, and if you convince them of the value of your effort, might be more diligent entering data in the future.
5. Pinpoint erroneous data
Now you’ll want to cross-check your data within and across your datasets. While your fields may be standardized and complete, they may be inconsistent or contradictory. There are a few approaches you can take to identify and correct erroneous data. Triangulation, which involves drawing on multiple data sources to cross-check information, is one. Complementarity, where qualitative and quantitative approaches are used together, is another. Data visualization can also help you identify contradictory data, as well as to identify outliers.
6. Look for poorly sampled data
Not all data is created equally. Budget constraints, poor survey designs, and small sample sizes can result in less-than-ideal, oversampled, or undersampled data collection, introducing bias into your results. You can address imbalanced datasets through strategic under- or oversampling of specific categories to mitigate the bias in your existing data. But while you can work around bad data, be aware that you can’t really fix it. What’s important to know is where the issues lie. For example, if a sample size is small, view any conclusions with suspicions. If a survey question appears biased, take that into account when interpreting results. Additionally, be aware of bounds. If you’re trying to understand how your product will perceived in Mexico but are only processing English responses, you won’t be getting the full picture, much like how social science experiments run using undergraduate students can fail to be representative of the larger population.
7. Keep an eye on changing data
Your dataset may be sensitive to the passage of time. Values that may have been true or accurate at the time of collection, such as job roles or ages, or even the baselines or benchmarks you’re using, may no longer be applicable. Changing relationships and environments are another factor. For example, your data might tell you that emphasizing your product’s low price is a winning sales strategy. But this may become less true as time passes. For this reason you should be cautious when drawing conclusions from old data. Finally, data formats may also change over time. Survey questions change, databases are refactored or stop being maintained. Have contingency plans in place to deal with changing data or systems, but also be mindful of the assumptions you’re making about your data and whether they still apply.
Clean data will cost you, but messy data will cost you more
Data is messy and always changing, and staying on top of it takes a lot of work. But organizations that do can make smarter decisions than the competition. Additionally, since gathering, cleaning, and analyzing data isn’t easy, building flexible data pipeline systems can be a surprisingly durable competitive advantage.
While messy data can lead to inaccuracies or poor strategic decision-making, clean data can be invaluable in helping you improve every area of your business: whether that’s tailoring ad campaigns, deciding on new products or services, or getting ahead of a social media controversy. Data cleaning may be a challenging and time-consuming task, but it’s much less expensive to your business than working with messy or outdated data.