Data is often called the raw material of the information age, and it does share characteristics with the resources that power other industries. For example, imagine trying to make a car out of unrefined iron ore. A lot of processing happens between the mine and the factory. Data is no different. In its “raw” form, data may be difficult or impossible to use until it has been refined, whether by converting it to a readable file format or cleaning it to remove errors and corruption. Data preparation is the process of transforming data from its unusable raw form into a valuable asset.
What Is Data Preparation?
Data preparation removes the errors, duplications, and missing elements of raw data to make it available for processing and analysis by information systems. Before raw data can be processed and analyzed, it has to be cleaned, formatted, standardized, and organized. These operations represent the fundamentals of data preparation.
Organizations collect raw data from many different sources, including the internet, public and commercial datasets, consumer surveys and interviews, and data archives. Data sourcing is the process of collecting raw data from machines through sensors, from humans through direct and indirect interactions, and from business systems, researchers, and third parties, including data brokers.
The goal of data sourcing is to target the best data available, verify its quality before collection, and document the collection process.
- The data being collected is checked for errors, and its accuracy, reliability, consistency, and completeness are confirmed.
- Sourcing verifies that the data is fit for its intended purpose.
- The data is also tested for compliance with privacy regulations and security requirements.
Preparing data for use in machine learning (ML) systems requires transforming it by applying data normalization and encoding to confirm its compatibility with ML algorithms. To ensure the most efficient processing possible, the data’s complexity is reduced using dimensionality reduction and other techniques so that only the information that the ML model needs is preserved.
Benefits of Data Preparation
Data preparation is intended to improve the quality of the information that ML and other information systems use as the foundation of their analyses and predictions. Higher-quality data leads to greater accuracy in the analyses the systems generate in support of business decision-makers. This is the textbook explanation of the link between data preparation and business outcomes, but in practice, the connection is less linear.
Market research firm Gartner estimates that poor data quality costs companies an average of $12.9 million each year, in part by increasing the complexity of information systems and making decision support operations less effective. However, when data preparation is done right, organizations benefit in ways beyond processing efficiency and enhanced decisions:
- Data consistency promotes collaboration within and between teams by giving all participants access to the same information at the same time. This establishes a single source of truth in the company, which keeps all boats pointing in the same direction and on a singular course.
- Customers benefit by interacting with company representatives who have a complete and up-to-date record of their profiles and transaction histories. Employees can resolve customer issues quickly and accurately, making them more efficient and their clients happier.
- Data preparation helps organizations eliminate silos that lock out some data users. Fast access to a central store of data by all business apps improves the quality of analyses and the effectiveness of the decisions that are made based on the analyses.
- Properly prepared data maximizes the return companies realize from their investment in AI. ML algorithms require a steady diet of high-quality and relevant datasets for training and problem-solving.
Careful data preparation adds value to the data itself, as well as to the information systems that rely on the data. It goes beyond checking for accuracy and relevance and removing errors and extraneous elements. The data-prep stage gives organizations the opportunity to supplement the information by adding geolocation, sentiment analysis, topic modeling, and other aspects.
Data Preparation: Step by Step
Building an effective data preparation pipeline begins long before any data has been collected. As with most projects, the preparation starts at the end: identifying the organization’s goals and objectives, and determining the data and tools required to achieve those goals.
These are the steps involved in planning and implementing a data preparation strategy:
- Objectives and requirements: Start by laying out the purpose and scope of the data preparation project, including the roles and responsibilities of its users, what they expect to accomplish from using it, and the data sources, formats, and types that will serve as inputs. Also determine the requirements for data accuracy, completeness, timeliness, and relevance, as well as the ethical and regulatory standards it must adhere to.
- Data collection: Tap the files, databases, websites, and other resources that contain the raw data required to achieve the project’s goals. Confirm the reliability and trustworthiness of the sources prior to collection, and then apply web scrapers, APIs, and other tools for accessing the data sources. The more varied the resources contributing to the collection, the more comprehensive and accurate the resulting data store will be.
- Data integration: Data cleansing converts the information into formats that enable a single comprehensive view of data inputs and outputs. Standard formats include CSV, JSON, and XML. Cloud storage and data warehouses serve as centralized data repositories providing safe and simple access while supporting consistency and governance.
- Data profiling: Each dataset is analyzed to identify its structure, content, quality, and characteristics. To enhance precision, the analysis confirms that data columns contain standard data types. Profiling verifies uniformity and highlights anomalies in the data, such as null values and errors. The profile incorporates metadata, definitions, descriptions, and sources, as well as data frequencies, ranges, and distributions.
- Data exploration: This step discovers the patterns, trends, and other characteristics contained in the data to provide a clear picture of its quality and suitability for specific analysis tasks. Descriptive statistics reveal aspects such as mean, median, mode, and standard deviation, while histograms, box plots, scatterplots, and other visualizations show data distributions, patterns, and relationships.
- Data transformation: Data formats, structures, and values are reconciled to eliminate incompatibilities between the source and the target system or application. Techniques used to ensure the data is accessible and usable include normalization, aggregation, and filtering.
- Data enrichment: In this step, the data is refined and enhanced by combining it with related information gathered from other sources, and segmenting it into entity groups or attributes, such as demographic or location data. Missing values can be estimated based on other data, such as “age” from a person’s date of birth. Unstructured text is assigned categories, and context can be added using geocoding, entity recognition, and other techniques.
- Data validation: The accuracy, completeness, and consistency of the data is confirmed by checking it against predetermined criteria and rules based on the requirements of your systems and apps. Validation confirms data types, ranges, and distributions, and it identifies missing values and other potential gaps.
- Data sharing and documentation: Maintaining the data and confirming that it complies with applicable regulations requires documenting its definitions, descriptions, sources, formats, and types. Metadata standards for this purpose include the Dublin Core, Schema.org, and JSON-LD.
Challenges of Data Preparation for Machine Learning and AI
Three misconceptions about preparing data for ML and AI applications cause projects to go off the rails:
- More equals better. In fact, less is decidedly more when deciding the datasets that will power ML systems, so long as they’re the right datasets. Too much data leads to inefficiencies, wasted resources, and noise that degrades the model’s performance, accuracy, and reliability.
- Do it once. There’s nothing sequential about preparing data for ML processing because new, more relevant data is always being generated. Also, as models learn, their needs change, so your data-preparation priorities and sources will need to be updated.
- Manual is better. The pace of modern business dictates that any process that can be automated reliably, should be automated. Human-powered data preparation is time-consuming and likely to introduce errors that faster automated tools avoid.
Many of the factors that hinder data preparation efforts relate to characteristics of the data itself, such as using inconsistent data formats, relying on biased data (skewed to favor a specific population or location, for example), insufficient data labeling, and collection of outdated or irrelevant data.
Appropriate data preparation is the key to the successful development and implementation of AI systems in large part because AI amplifies existing data quality problems. For example, it may cause an ML-based application to generate analyses that appear valid but don’t accurately represent the real-world situation they attempt to model. The fundamentals of data preparation form the foundation of the AI applications that hold so much promise for individuals and businesses alike.