According to Gartner, 85% of Data Science projects fail (and are predicted to do so through 2022). I suspect the failure rates are even higher, as more and more organizations today are trying to utilize the power of data to improve their services or create new revenue streams. Not having the “right” data continues to prevent businesses from making the best choices. But live production data is also a massive liability, as it requires regulatory governance. Hence, many organizations are now turning towards using synthetic data – aka fake data – to train their machine learning models.
Synthetic data solves many problems: It doesn’t require compliance to data regulations, can be used in test environments, and is readily available. However, relying on poorly created synthetic data also means there is a risk that the model can fail the minute it is productionized.
Let’s explore this in detail.
Is Poor Data Quality Causing a Competitive Disadvantage?
Organizations with good core data are winning at the analytics game. It is evident that investment upfront on improving and maintaining good-quality data pays dividends in the future.
It has been estimated that data scientists spend almost half of their time not solving business problems but rather cleansing and loading data. Simple arithmetic tells us that we either require double the talent or solve half the allocated business problems.
Over and above inefficiencies in resources, poor-quality data is also accountable for a large amount of revenue leakage, lack of trust across the enterprise, delayed “go-to-market” strategies, and lack of data-driven decision-making, leading to erosion of trust with customers and regulators. So, it is clear that poor data quality is causing a competitive disadvantage.
How to Restrict Liability of Real Data by Using Synthetic Data
As mentioned earlier, live production data is a huge liability. Organizations need to exercise data minimization in their analytics and Data Science initiatives. This is not just to keep the regulators happy but is also in line with the ethical practice of “doing right by the customer.”
Machine learning models require a large amount of usable data to train effectively. This data often needs to be enriched to ensure all bases are covered. For example, if data is only good enough for scenario A, and scenario B is also possible, but there is not enough data for it, the data would be complemented with additional synthetic data.
If data is synthetic, it means:
- It does not need to be compliant with GDPR and other regulations
- It can be made in abundance for a variety of conditions and drivers
- Data can be created for unencountered conditions
- Data can be well-cataloged
- Data creation is highly cost-effective
Why Remediating Data Quality Is the Right Answer
Now that we understand that poor-quality data is causing a competitive disadvantage and synthetic data is solving many problems, let’s marry the two.
How do you create synthetic data?
A simplistic solution would be to analyze the production data and replicate its statistical properties, but a more realistic approach would be to create a machine learning model to replicate real-life data properties, parameters, and constraints. This is a more complex approach, and there are many open-source ways of doing this.
If the synthetic data does not replicate the poor data quality of the real-life data, then there is a high likelihood that this machine learning model will fail upon productionization. The only way to resolve this is to ensure robust data quality checks on the real-life data.
Completeness, accuracy, and uniqueness checks will help resolve many data quality issues. Reconciliation of data through its pipelines will resolve even more issues.
Finding data quality issues and remediating them is essential before relying on synthetic data to solve business problems.
Conclusion
Synthetic data simulation is an excellent concept; however, it should not be mistaken for the resolution of all data issues we face daily in Data Science.
Covering the problem by creating new data will not make the original issue disappear. Investment in data quality will pay dividends, and it is well worth implementing.