When we talk to prospective customers, their first questions are usually around the fundamentals of data quality, including what it is, how we measure it, what happens when it goes south, and how data quality issues can be prevented.
Our answers always come back to the core that drives our mission: Data is the lifeblood of the modern enterprise, and having the confidence to make business decisions based on data is critical. Data quality, then, is the quality of that lifeblood. In order for it to power your business engine correctly, it’s important to continuously perform accurate and comprehensive data quality checks. It’s also important to focus these checks on data that has the potential to have the highest impact on business decision-making – especially when you’re trying to monitor data quality at scale.
What’s Hard About Monitoring Data Quality at Scale?
Any mention of “data quality” tends to paint a picture of a domain expert tediously inspecting and interpreting records of data by hand one at a time. It feels like an analysis exercise littered with manual judgment drawing in troves of context and tribal knowledge accumulated over the years. And it feels like a process that is impossible to scale.
The skepticism around building out data quality checks at scale is fair. Traditionally, data quality issues have been close to the line of business. Such issues with data end up being nuanced errors like incorrect operating hours being displayed online for a brick-and-mortar store location, which could impact customer engagement of a marketing campaign; or, a newly designed piece of clothing being incorrectly tagged; or, the incorrect size being input at the point of sale, which would feed improper data into inventory reporting or sales forecasts for a given market.
Data quality today is a much broader spectrum in the scaled-out modern data stack. In particular, issues that are born out of data operations at scale are distinct from the subjective line-of-business data quality issues. So, while subjective checks are hard to scale, maybe the problem that really needs solving at scale isn’t that subjective after all.
Not All Data Quality Issues Are Subjective
In the modern data stack, data quality issues can range from semantic and subjective – which are hard to define – to operational and objective, which are easy to define. For instance, objective and easier-to-define issues would be data showing up with empty fields, duplicate transactions being recorded, or even missing transactions. More concrete, operational issues could be data uploads not happening on time for critical reporting, or a data schema change that drops an important field.
Whether a data quality issue is highly subjective or unambiguously objective depends on the layer of the data stack it originates from. A modern data stack and the teams supporting it are commonly structured into two broad layers: 1) the data platform or infrastructure layer; and, 2) the analytical and reporting layer. The platform team, made up of data engineers, maintains the data infrastructure and acts as the producer of data. This team serves the consumers at the analytical layer ranging from analytics engineers, data analysts, and business stakeholders.
At the highest layers of the stack, issues tend to be domain-specific, subjective, and hard to detect automatically. At the platform layer, data quality issues stem from failures in data operations. And while those platform-layer issues are too frequently assumed to be subjective and complex to monitor, they are usually cut-and-dry issues.
Operational data quality issues support the attachment of objective criteria based on the specification of SLIs/SLOs/SLAs, unlike subjective issues that require manual judgment from a business stakeholder, such as an analyst. And those issues tend to cluster into a small set of common categories across a variety of businesses and data stacks. This makes it possible for the right tool to provide out-of-the-box primitives to detect such issues and easily support a workflow that scales across the enterprise.
So, what exactly is the set of common operation data quality issues that can be easily avoided?
Common Operational Data Quality Pitfalls
For obvious reasons, operational data quality issues can have a huge impact on a business and typically fall into one of four buckets.
1. Data availability issues: Data shows up too late, in the future, or not at all; data drops in volume; data shows up in duplicates.
2. Data conformity issues: Data shows up with the wrong schema or wrong data types; data doesn’t match the expected regular expression (e.g., an incorrect number of credit card digits); alphanumeric strings in place of numerals.
3. Data validity issues: Data shows up with unexpected values, even though it’s available with the right volume at the right time and in the right format. If you’re looking at financial data, for example, it could show up in cents instead of dollars, which means it’s off by a factor of 100 relative to what it normally looks like.
4. Data reconciliation issues: Data is inconsistent at two different points in the data pipeline. This might look like a number of sales transactions ingested into a landing table that doesn’t match the processed table feeding the BI dashboard, or a sum of payment transactions for a merchant that doesn’t match the disbursed fulfillments by the bank.
While these operational data quality issues are quite common, they can be easily avoided with automated, proactive monitoring.
How to Avoid Operational Data Quality Issues
Each data issue can be assigned time series metrics called Data Quality Indicators (DQIs) that can be continuously computed and proactively monitored. DQIs are effectively the SLIs attached to the data layer. Criteria for a DQI to be considered anomalous derive from service level objectives (SLOs) and service level agreements (SLAs) established by the business and data owners.
Operational data quality issues and the associated DQIs related to the operation of the data pipeline are universal. For example, a DQI could be the data freshness of a table (the age of the newest row). The expectation of this DQI is a specification of the cadence at which the data pipeline should run and refresh the table. It could be hourly, daily, or every minute. Unlike KPIs, which measure the health of the business and often tend to be subjective, DQIs measure the health of data operations and are unambiguously evaluated based on the specification of the data pipeline. Moreover, the set of DQIs needed to track the operational data quality issues mentioned earlier are universal – meaning they apply to a data pipeline regardless of the vertical or specifics of the business.
Because DQIs are universal, platforms can provide built-in DQIs that can be lit up on a data pipeline across all data assets (tables, views, and columns) with little or no configuration. They can rapidly deploy quality checks across your entire data landscape, providing instant visibility into data anomalies and data quality intelligence that ensures the highest level of data health. This has allowed data teams to hit their data quality coverage goals 10 times faster than legacy data quality solutions. The custom configuration of DQIs easily enables a fine-tuning of indicators so that as your data scales, any non-compliant data or anomalies are detected by AI technology for immediate analysis, ultimately clearing the way for optimal decision-making that propels the business forward.
Originally published on the Lightup blog.