Click to learn more about author Avee Mittal.
Data cleansing is an important step to prepare data for analysis. It is a process of preparing data to meet the quality criteria such as validity, uniformity, accuracy, consistency, and completeness. Data cleansing removes unwanted, duplicate, and incorrect data from datasets, thus helping the analyst to develop accurate insight.
Organizations today are sitting on a pile of data. Most of them use advanced data tools to collect a variety of data in large volumes. The raw data often contains inaccuracies, which if not removed, can result in false outcomes. Data cleansing or data scrubbing is a fundamental step in data analysis to arrive at the right context and conclusion.
Set Up Quality Criteria
The first step in data cleaning is to establish a set of quality criteria. The criteria must be well-defined to filter out “dirty data” from the datasets. Here are a few criteria that you can define.
Data Constraints: There are various type of data constraints to ensure that data collected is of optimum quality, such as:
- Data-type constraints: limited scope to register values, such as boolean, numeric, date, and so on.
- Range constraints: numerical values are not absolute, rather have minimum and/or maximum permissible values.
- Mandatory constraints: All columns must be filled.
- Unique constraints: Every field or combination of fields must be unique across a dataset.
- Membership constraints: Set of discrete values or codes in every column.
- Foreign-key constraints: Sub values can be added to other columns with a set of permissible values.
- Regular expression patterns: Set permissible patterns such as patterns of phone number, station code, pincode, and so on.
- Cross-field validation: for example, in an organization database, data of joining an employee cannot be earlier than the date of resignation.
Setting up data constraints helps in maintaining the quality of data at the very beginning of the data collection process.
Ensure Accuracy
Although data constraints ensure correct data value to a degree, accuracy remains a challenge considering the subjective aspects of the field. For example, if one has to fill a street address, then setting pattern constraints helps in ensuring that pincode and state is correct, but it is possible that the street address may not be true.
It is further possible that patterns can be evaded. For example, although the phone number pattern is correct but the number may not be correct or operational.
Remove Duplicate Data
Duplicacy in datasets occurs when you combine multiple datasets to generate insight. Duplicate data may be a result of incomplete data or a problem in data structures. For example, in a hospital dataset, if the same patient visits for other ailments, you might not want to count it as two different patients from the same location, with the same phone number. Such duplication of data can create false narrative or affect the outcome in large.
Accept Complete Data Only
While most software tools do not accept incomplete data, using spreadsheets or paper-based forms can lead to incompleteness of data. For example, in a maintenance inspection form, the type of the problem or the solution provided may not be described to the full or certain fields left empty due to ignorance or unavailability of the information.
Maintain Consistency and Uniformity
Data is not consistent when two values in different fields contradict each other. This happens when data of multiple departments or time frames are combined. For example, in old data, an employee’s designation must be different to what it is now. If data is not properly sorted then such issues might affect the outcome of the data.
Uniformity of data is referred to the unit of measure. For example, you may measure distance in miles or kilometers or value of currency in dollar, pound, or Euros. Both consistency and uniformity of data is essential to achieve accurate results.
4-Step Data Cleansing Process
Data cleaning is a 3-step process, which includes considering all the above criteria to maintain data integrity. Once the data is collected based on the above criteria, here is what could be done to improve accuracy.
- Inspection: Perform inspection to detect inaccuracy in data across the quality criteria
- Clean: Remove the anomalies from the datasets
- Verify: Cross-validate the cleaning process to ensure every aspect of the inaccuracy and inconsistency has been removed
- Report: Create a report about the changes made and about the current quality of data recorded in the system.
Data cleansing must be a continuous process, as with time, situation changes and that affects the quality of data recorded. Old, outdated data may not prove helpful. For instance, if you will not update your field sales CRM data from time to time, then it might affect your selling potential in the long run.
Periodic cleaning of data helps you generate accurate insight. Apparently, accuracy of the insight would affect your business decisions and drive your strategies.