Mind the Gap: Did You Know About the ISO 25000 Series Data Quality Standards? Me Neither

This is the first in a two-part series exploring Data Quality and the ISO 25000 standard.

In the 1964 dark comedy Dr. Strangelove, or How I Stopped Worrying and Learned to Love the Bomb, General Jack D. Ripper orders a nuclear strike on the USSR. Despite efforts to recall the bombers, one plane successfully drops a bomb on a Soviet target, triggering the Doomsday Device. Once activated it cannot be disarmed and the result is global annihilation. The Doomsday Device was built as a deterrent, but it was activated before it was announced. Nobody knew about it. “The Premier was going to announce it on Monday. As you know, the Premier loves surprises.”

What if there was an international standard for data quality, but nobody knew about it?

There is: ISO 25012.

OK, it’s not even remotely the same thing. I get it. But when I first heard about the ISO 25012 standard, I was reminded of that scene.

The International Standards Organization (ISO) creates and publishes standards for quality, safety, and efficiency. I suspect that many of you are familiar with the ISO 9001 standard for quality management. Maybe fewer with the ISO standards for screw threads (ISO 261-263) that enable bolts made by one manufacturer to fit into nuts made by another.

Same folks.

In 2023, ISO published the 42001 standard that “specifies requirements for establishing, implementing, maintaining, and continually improving an Artificial Intelligence Management System (AIMS) within organizations. It is designed for entities providing or utilizing AI-based products or services, ensuring responsible development and use of AI systems.” You heard about that one? I hadn’t either until I went poking around the ISO website.

Individual countries then adapt the ISO standards to their specific needs. In the United States, that’s the responsibility of the American National Standards Institute (ANSI). In India, it’s the Bureau of Indian Standards (BIS). And so forth. Sometimes the ISO standard is adopted in its entirety. Sometimes not. A familiar example is signage in public spaces, like exit signs. The ISO standard uses pictograms that can be understood regardless of language. The ANSI standard requires the English word “EXIT.”

The full title for the ISO 25000 series is System and Software Engineering — Systems and Software Quality Requirements and Evaluation, abbreviated SQuaRE. Portions, some under different standard numbers and titles, have been under development since the 1980s and were consolidated into SQuaRE in 2005. Most have been updated since then. For the record, in the United States, ANSI has fully adopted the ISO 25000 standard.

The standard consists of five quality-focused divisions – requirements, model, management, measurement, and evaluation – as well as an extension division that addresses specific application domains for a total of 20 different standards.

One challenge for practitioners is that it costs about $3,000 to download all of the documents in the standard group.

This barrier to even seeing much of the standard also reminded me of the movie. If the objective is to get companies to conform to the standard, that doesn’t seem like a very good way to go about it. Fortunately, some of the details have been publicly published in conference presentations, white papers, and by country-specific standards organizations.

The Guide to SQuaRE provides an overview and roadmap for the standard and can be downloaded from ISO for free. Well, more accurately, it can be purchased for zero Swiss francs. I won’t reproduce all the details here, but it is worthwhile to read.

From the Introduction:

The general goal of creating the SQuaRE set of International Standards was to move to a logically organized, enriched and unified series covering two main processes: software quality requirements specification and systems and software quality evaluation, supported by a systems and software quality measurement process. The purpose of the SQuaRE set of International Standards is to assist those developing and acquiring systems and software products with the specification and evaluation of quality requirements. It establishes criteria for the specification of systems and software product quality requirements, their measurement, and evaluation.

The focus of the standard (as its name suggests) is software quality, but highlighted as an innovation is the introduction of a data quality model. Finally, we’ve found our way to data, and to a pair of standards in particular:

25012: Data Quality Model

25024: Measurement of Data Quality

The Data Quality Model is made up of fifteen data quality characteristics, categorized as inherent, system dependent, or a combination of the two.

Inherent characteristics apply to the data itself: domain values, business rules, relationships, and metadata. They describe the potential of the data to “satisfy stated and implied needs when the data is used under specified conditions.” I’m going to defer the question of whose specified conditions for the time being. It’s an important question but not one that’s critical to address right now.

The inherent data quality characteristics are:

Accuracy: The degree to which data has attributes that correctly represent the true value of the intended attribute of a concept or event in a specific context of use, both syntactically and semantically. Also included is data model accuracy.

Completeness: The degree to which subject data associated with an entity has values for all expected attributes and related entity instances in a specific context of use. In other words, determining whether all of the attributes are populated and all of the records are present. Also included is conceptual data model completeness. From an analytics perspective, completeness needs to be considered not just for a single data file, system, or table, but across each domain. For example, a company might have four different customer systems, each collecting a portion of the information about a customer. A data feed might be complete with respect to a particular source system, but we need to know how much of the entire domain is covered not just by that system but the set of customer systems as well.

Consistency: The degree to which data has attributes that are free from contradiction and are coherent with other data in a specific context of use. It can be either or both among data regarding one entity and across similar data for comparable entities. This includes semantic consistency, referential integrity, data value consistency, and data format and database type consistency.

Credibility: The degree to which data has attributes that are regarded as true and believable by users in a specific context of use. Credibility includes the concept of authenticity (the truthfulness of origins, attributions, commitments). This is referred to as validity in other data quality models. This characteristic also considers the credibility of the data dictionary, data model, and data sources (i.e., authoritative sources).

Currentness: The degree to which data has attributes that are of the right age in a specific context of use. This is referred to as timeliness in other data quality models. Although a characteristic of the data itself, it is strongly influenced by the source system implementation. For instance, a car passing through a toll reader generates a toll event instantaneously, but the system that processes those events might accumulate them throughout the day before processing them. Although potentially available immediately, from the perspective of downstream systems, “new” data can be up to twenty-four hours old. That is not a function of the data itself, but of the systems that process it.

The Inherent characteristics largely align with the DAMA-DMBoK dimensions of data quality: accuracy, completeness, consistency, timeliness, uniqueness, and validity. The only one that’s missing from the ISO standard is uniqueness, which is obliquely addressed by the Efficiency characteristic which I’ll talk more about later.

System-dependent data quality characteristics, as the name suggests, describe the impact of computer systems components on the data. Quality is a function of the application implementation and the operational environment. For that reason, I consider these to be more Application Quality characteristics than data quality characteristics.

The system-dependent data quality characteristics are:

Availability: The degree to which data has attributes that enable it to be retrieved by authorized users and/or applications in a specific context or use. From the systems perspective, this is usually referred to as uptime.

Portability: The degree to which data has attributes that enable it to be installed, replaced or moved from one system to another preserving the existing quality in a specific context of use. This is a big one when looking ahead to the possibility of changing your repository, development framework, or cloud provider. How many database-specific, application development platform-specific, or cloud provider-specific capabilities are you using?

Recoverability: The degree to which data has attributes that enable it to maintain and preserve a specified level of operations and quality, even in the event of failure, in a specific context of use. This measures the success of backup and recovery processes. Today, data volumes make offline backups impractical and companies are increasingly choosing to deploy multiple, geographically dispersed instances of the data instead.

The remaining data quality characteristics share features that are both inherent and system dependent. Most are of the form “The data supports X” and “The technology supports X.” More or less. The overlapping data quality characteristics having this pattern are:

Accessibility: The degree to which data can be accessed in a specific context of use, particularly by people who need supporting technology or special configuration because of some disability. I understand the reasons for wanting to measure this, but it is not a function of the quality of the data. For example, a screen reader cannot (yet) interpret an image. That doesn’t mean, though, that the image data is low quality. The limitation is inherent to the problem itself. Not supporting assistive technology like a screen reader so that textual data is accessible to the visually impaired is an application design or requirements issue.

Compliance: The degree to which data has attributes that adhere to standards, conventions or regulations in force and similar rules relating to data quality in a specific context of use. Regulations may require that data have certain values or formats, perhaps for information interchange between entities. Business rules may also need to be incorporated into the applications that enforce compliance.

Confidentiality: The degree to which data has attributes that ensure that it is only accessible and interpretable by authorized users in a specific context of use. At this point, encrypting personal and confidential information and protecting computer systems from unauthorized access should be standard operating procedure.

Precision: The degree to which data has attributes that are exact or that provide discrimination in a specific context of use. Think significant figures from high school chemistry. Does the data have the right decimal format and does the application use the right decimal format. I believe, though, that this should be extended to incorporate ontological granularity, which is just a fancy way of saying that needing to know the location of a specific rooftop is different from needing to know the location of a country. Rooftop is greater precision than country.

Traceability: The degree to which data has attributes that provide an audit trail of access to the data and of any changes made to the data in a specific context of use. This has two parts. The first is logging user access. The second is logging data item value changes. With the latter, it would be easy to make the leap directly to lineage, but this actually refers to keeping a history of data element values. It’s like what we do for slowly changing dimensions. I suppose if the data changes as it moves between systems, then some knowledge of lineage is implied. But lineage as it is most commonly understood in data governance is not included in the standard.

I would recommend that capturing data lineage be added as a system-dependent characteristic of traceability.

Call it Tra-D-2 for those of you following along with the standard at home. Data movement traceability could be defined as “the possibility to trace the history of the movement of a data item between applications and systems using system capabilities.”

The last two quality characteristics, efficiency and understandability, warrant a little more discussion.

Efficiency: The degree to which data has attributes that can be processed and provide the expected levels of performance by using the appropriate amounts and types of resources in a specific context of use.

This one is particularly interesting (to me at least) in that it seeks to measure the impact of design decisions on storage, processing, and user experience.

The inherent perspective focuses on syntactic and semantic ease of use. Numeric values stored as strings are harder to use than numeric values stored as integers or floating-point decimals. Distances stored in miles are harder to use in a country that uses the metric system than distances stored in kilometers.

Usability can be evaluated by comparing the time it takes experienced and novice users to complete the same task, but for the most part the definitions of “efficient” and “efficiency” are subjective. The potential for defining a set of objective guidelines is mentioned but not fleshed out. Someone has to decide whether the space or processing or time consumed was efficient or not.

The system-dependent perspective focuses on space, processing, and time. Text stored in fixed-length strings usually require more space than text stored in variable-length strings. Numeric values stored as strings must be converted before mathematical operations can be applied.

Guidelines that inform most of these design decisions have been known to DBAs for decades, but they are not explicitly described in the standard.

The latency in data movement between systems is also considered part of efficiency. Unlike the currentness characteristic that measures latency relative to a business requirement, here latency is simply measured as a proxy that aggregates efficiency across all system components: data architecture, network, application, repository, etc.

If you’ve been keeping track of your traditional data quality dimensions, you may have noticed that one is still missing: uniqueness. We find it here. Quantifying the instances of duplicate records is an efficiency quality measure.

When considering efficiency, it is critical to recognize that independently optimizing individual measures can reduce the efficiency of the system as a whole.

Minimizing space consumed may increase the processing required and complexity. The metrics all look good but the system performs poorly and is hard to use.

These trade-offs are similar to those considered in data modeling. A pedantic data modeler may require strict third normal form in the logical database model. In practice, though, implementing that logical schema physically may require more repository space, processing, and network consumption. Plus, as an added bonus, you end up with increased complexity and difficulty for data consumers.

Finally, I would add one more perspective to consider: the use of summarization, aggregation, pre-calculation, and denormalization, especially for commonly used metrics. The space required may increase, but that would be offset by reduced runtime resource consumption and greater ease of use.

Efficiency optimization is both science and art.

That brings us to the final quality characteristic:

Understandability: The degree to which data has attributes that enable it to be read and interpreted by users, and are expressed in appropriate languages, symbols and units in a specific context of use. Some information about data understandability are provided by metadata.

If you’ve read any of my blog articles, or really pretty much anything I’ve written or presented about data and analytics, you know that I believe that understandability is fundamental.

The standard first addresses the understandability of the symbols, character set, and alphabet used to represent data. This is important for multi-language support.

Next comes the understandability of data elements. Ensuring that metadata is defined for all data elements is a measure within the completeness characteristic. Interestingly, the understandability of that metadata is not measured for all data elements, but only for master data which is defined as, “data held by an organization that describes the entities that are both independent and fundamental for an enterprise that it needs to reference in order to perform its transaction.”

This begs the question: Which data elements do we not need to understand?

If the data is being stored, then it is being consumed. And if the data is being consumed, then the consumer needs to understand what it means. (And if the data is not being consumed, then it does not need to be stored.)

What is measured across the board, though, is semantic understandability: the ratio is the number of data values defined in the data dictionary using a common vocabulary (read: business glossary) to the number of data values defined in the data dictionary. This essentially evaluates the completeness of the business glossary and how well it is being used.

Finally, the standard includes several subjective measures of the understandability of data values, data models, and data representations. All are quantified through interviews and questionnaires, or by “counting the number of users’ complaints.”

At its core, understandability is a measure of metadata quality.

That covers all of the ISO 25012 data quality characteristics and the ISO 25024 quality measures associated with them.

Unfortunately, many organizations are not prepared to implement these measures.

Why not? I’ll discuss that in next month’s Mind the Gap.

TRAIN TO GET CERTIFIED AS A DATA QUALITY SPECIALIST