Mind the Gap: Data Quality Is Not “Fit for Purpose”

Welcome to the latest edition of Mind the Gap, a monthly column exploring practical approaches for improving data understanding and data utilization (and whatever else seems interesting enough to share). Last month, we explored the rise of the data product. This month, we’ll look at data quality vs. data fitness.

Everybody likes a pithy definition. Marketers describe them as “sticky,” or easy to remember. Of course, that doesn’t always mean they’re useful or completely accurate. Information management has a couple. Metadata is almost universally described as “data about data,” but I’d be willing to bet that you rolled your eyes just now. How many times have we seen metadata introduced in that way, with the presenter or author immediately apologizing and then moving on to a more useful description.

Similarly, the data quality bumper sticker reads “fit for purpose.” You can probably already guess that I’m not a fan. Let’s pull out our DMBoK and see what it says:

The term data quality refers both to the characteristics associated with high quality data and the processes used to measure or improve the quality of data. [DMBoK-2, 644]

Characteristics and processes. Sounds good so far. Continuing:

Data is of high quality to the degree that it meets the expectations and needs of data consumers. That is, if the data is fit for the purposes to which they want to apply it. It is of low quality if it is not fit for those purposes. Data quality is thus dependent on context and on the needs of the data consumer. [DMBoK-2, 644; emphasis added]

This definition has deep roots in the field of quality management, incorporating concepts articulated by some of its giants, including Joseph Juran (“fitness for use”), Philip Crosby (“conformance to requirements”), and W. Edwards Deming (“meeting or exceeding the customer’s expectation”). Their common thread is the focus on consumption, driven by precise and complete specifications. Quality improves as requirements and processes improve. Discipline in this approach is common with engineered and manufactured items like missiles, automobiles, and consumer goods.

It is less common with data.

Far be it for me to challenge the accumulated knowledge of our field, but I very strongly disagree with defining data quality as “fit for purpose.”

Imagine you’re looking to purchase a used car to get you back and forth to work. You don’t have a lot of money, but you have to drive only a couple miles each way. You find a car that’s extremely inexpensive, but the engine overheats after running for about a half hour. You buy it despite the engine problem because it satisfies your requirements: really low price and five-mile commute. It is fit for your purpose, and therefore, by the DMBoK definition, it is “high quality.”

One day, you want to visit family a couple hundred miles away. You set out in your “high-quality” car and haven’t even completed 10% of the trip when you have to stop and let the engine cool. At this rate, the journey will take days. You curse this piece of junk. The car is now “low quality” because it does not satisfy the new purpose to which you wanted to apply it.

The car was evaluated as both high quality and low quality, even though nothing about the car changed.

It was your perception of the car’s quality relative to a new purpose that changed.

When talking about data quality, we must therefore be clear about whose purpose, what requirements, established when, and by whom.

Within the context of the DMBoK definition, the answer is that every consumer evaluates the quality of a data set independently. Data is considered to be of high quality when it is fit for my purpose, satisfies my requirements, established by me when I need the data.

Data quality, defined in this way, is truly in the eye of the beholder.

Furthermore, data quality analyses cannot be leveraged by new consumers. For decades, we in decision support have been selling the benefits of leveraging data across applications and analyses. It has been the fundamental justification for data warehouses, data lakes, data lakehouses, etc. But misalignment between the purpose for which data was created and the purpose for which it is being used may not be immediately apparent. Especially when the data is not well understood. The consequences are faulty models and erroneous analyses. We reflexively blame the quality of the data, but that’s not where the problem lies.

This is not data quality.

It is data fitness.

The DMBoK doesn’t recognize data fitness as a specific knowledge area but mentions it as part of data profiling:

Assessing the fitness of the data for a particular use requires documenting business rules and measuring how well the data meets those business rules. [DMBoK-2, 418; emphasis added]

But this sounds an awful lot like “data is of high quality to the degree that it meets the expectations and needs of data consumers.” It seems like quality and fitness are being conflated.

And confused.

I’m confused.

As a friend recently commented, “We need quality for the definition of quality.”

Let’s go back to the data headwaters: the customer for whom the data was created in the first place. The needs and utilization context for that customer were:

Expressed in their requirements, epics, features, and/or user stories
Captured in the data definitions, expected content, and other quality dimensions
Implemented in the application

The needs of additional downstream consumers known a priori may also have been considered, but most of these uses and users emerge after the application is deployed.

This original set of requirements is the only standard against which data quality should be measured. This allows us to definitively answer the questions of whose purpose, what requirements, established when, and by whom.

Data quality is the degree to which data conforms to the requirements for which it was created (definition, expected content, etc.).

We know how to do that. The DMBoK lists several data quality dimensions, each with objective measures. The standard is now clear.

The definition of data fitness also becomes clear.

Data fitness is the degree to which data conforms to the requirements for which it is being considered for use.

Data fitness, not data quality, is evaluated by each new potential consumer. The question being asked is, “Does this data satisfy my needs?” not, “Is this data of high quality?”

We know how to do that too.

Finally, consumers can request upstream application changes to accommodate their specific requirements. Don’t frame these requests as quality improvements, though. This might at least partially explain why development teams are less than excited to hear from us when we approach them with “data quality” issues related to our expectations, not their requirements.

I hate to introduce (or reintroduce) vocabulary into a field that drops new terms like a hay bailer, but I believe that it is worthwhile to more clearly differentiate between data fitness and data quality. Each has a different meaning and a different purpose. Each is a separate knowledge area.

Something to consider for DMBoK-3.

TRAIN TO GET CERTIFIED AS A DATA QUALITY SPECIALIST