Advertisement

Data Integrity: What It Is and Why It Matters

By on
data integrity

The term “garbage in, garbage out,” or GIGO, dates back to the earliest days of commercial computing in the mid-20th century. Yet the concept was present more than 100 years earlier at the very dawn of computing. When Charles Babbage first described his difference engine, a member of Parliament asked him whether the machine could generate the “right answers” if the “wrong figures” were put in. Babbage later wrote that he was unable to “apprehend the kind of confusion of ideas that could provoke such a question.”

Today, entering “wrong figures” and expecting “right answers” can be disastrous for a company’s decision-making. The potential value of an organization’s data increases along with its volume, but with one huge caveat: the data’s full value can only be realized if it’s shown to be accurate, complete, timely, relevant, and analyzable. Data whose integrity hasn’t been confirmed risks polluting a business’s data-driven processes and compromising its operations.

What Is Data Integrity? 

Data integrity is a concept that describes a condition or attribute of the data. It is also a process that confirms the data’s accuracy, completeness, consistency, and validity. Data integrity processes are designed to ensure that the organization’s interpretation of the information at the heart of its decision-making leads to reliable predictions, assessments, and actions. 

The integrity of data refers to both its physical condition and logical aspects:

Physical integrity involves safeguarding the data against damage or corruption due to power outages, hardware failures, natural disasters, and other external phenomena. The goal of physical integrity is to confirm that the data is accessible, and that it wasn’t altered during transmission, storage, or retrieval. Ensuring data’s physical integrity involves redundancy, disaster recovery, and fault tolerance.

  • Redundancy duplicates the data or other system components so an up-to-date backup copy is available in case of loss or damage.
  • Disaster recovery restores access to data that has been corrupted or lost due to an unexpected outage, storage device failure, or negligence on the part of data managers or users. Recovery typically relies on an off-site backup of the data.
  • Fault tolerance allows a data system to continue operating when a component fails. The goal of fault tolerance is to maintain normal operation until the failure can be corrected, but also to reduce the likelihood of a crash by building redundancies into the system to overlap the most critical functions. 

Logical integrity confirms that the data retains its state when it is used for various purposes within the database environment. It also protects the data from unauthorized changes or human error. It applies rules, constraints, and validation checks to prevent inconsistencies and preserve the data’s reliability. Four aspects of logical integrity are entity, referential, domain, and user-defined.

  • Entity integrity confirms that people, places, or things are accurately represented by their associated database elements. For example, the entity “orders” is a table made up of rows that represent individual orders. The table’s primary key is a unique value that identifies each row, which is called the “entity integrity constraint.” This prevents the data from appearing multiple times and ensures that there are no null fields in the table.
  • Referential integrity relates to the relationships of elements within and between tables as the data is transformed or queried. The goal is to maintain consistency when tables share data. For example, the “orders” table requires a customer ID field that resides in the “customers” table, which uses a key that is “foreign” to the table’s primary key. The foreign key refers back to the primary key in the original table.
  • Domain integrity applies to the data items in the table’s columns, each of which has a defined set of valid values, such as a five- or nine-digit number for a “ZIP code” column. Domain integrity is enforced by limiting the value assigned to an instance of that column, or its “attribute,” whether by confirming the data type or some characteristic, such as a date or character string.
  • User-defined integrity refers to custom business rules that fall outside of entity, referential, and domain integrity. This allows organizations to define the constraints that will apply to the way data is used for particular purposes. An example is requiring that a “customer name” field have both a first and last name.

How Data Integrity Differs from Data Quality

While data integrity focuses on the overall reliability of data in an organization, Data Quality considers both the integrity of the data and how reliable and applicable it is for its intended use. Preserving the integrity of data emphasizes keeping it intact, fully functional, and free of corruption for as long as it is needed. This is done primarily by managing how the data is entered, transmitted, and stored.

By contrast, Data Quality builds on methods for confirming the integrity of the data and also considers the data’s uniqueness, timeliness, accuracy, and consistency. Data is considered “high quality” when it ranks high in all these areas based on the assessment of data analysts. High-quality data is considered trustworthy and reliable for its intended applications based on the organization’s data validation rules.

The benefits of data integrity and Data Quality are distinct, despite some overlap. Data integrity allows a business to recover quickly and completely in the event of a system failure, prevent unauthorized access to or modification of the data, and support the company’s compliance efforts. By confirming the quality of their data, businesses improve the efficiency of their data operations, increase the value of their data, and enhance collaboration and decision-making. Data Quality efforts also help companies reduce their costs, enhance employee productivity, and establish closer relationships with their customers.

Data Integrity Best Practices

Implementing a data integrity strategy begins by identifying the sources of potential data corruption in your organization. These include human error, system malfunctions, unauthorized access, failure to validate and test, and lack of Governance. A data integrity plan operates at both the database level and business level.

Database integrity checks include referential integrity, unique constraint, data type, range, nullability, and check constraint.

  • Referential integrity confirms the consistency and accuracy of interactions between the database’s tables, which is especially important for relational database tables relying on foreign keys.
  • Unique constraint determines whether the values in a column or set of columns are unique across all table rows. This prevents duplication of values.
  • Data type checks ensure a column’s data matches its type, such as an integer column containing only numbers.
  • Range confirms that column values are correct, such as an age column accepting values from 0 to 120 only, for example.
  • Nullability makes sure none of the mandatory value fields in a column contain a null value, which can cause data corruption.
  • Check constraint prevents data from being added to the database until it has met all required criteria for ranges, formats, and conditions.

Business integrity checks are access controls, data backups, data validation, audits, employee training, and encryption and other security measures.

  • Access controls include faster biometric systems, management of remote workforces, combining digital and physical access controls, mobile-based controls, and AI-assisted access management.
  • Data backups are getting a boost in the form of immutable backups that can’t be altered or deleted once they’ve been created. 
  • Data validation confirms data types, formats, the presence of data, data consistency, range and constraint, uniqueness, syntax as required by each field, and the existence of necessary files. Structured data validation combines several such checks to validate internal data processes.
  • Audits identify the scope and relevance of the data to be checked, as well as the risks to the data’s integrity. They address the needs of all stakeholders and include a review of the database architecture.
  • Employee training for data integrity covers correct data entry, validation of data from outside sources, removal of duplicate data, regular data backups, use of access controls, establishing audit trails, and collaborating with others within and outside the organization.
  • Encryption best practices include using strong encryption algorithms such as the Advanced Encryption Standard (AES), updating encryption keys regularly, applying multi-factor authentication, and ensuring that data is encrypted while at rest (stored) and in transit (transmissions).

Data Integrity Challenges

Ensuring the integrity of the data in your organization requires attention to the human and machine aspects of Data Management. Data integrity can be hindered by a lack of integration among data sources and systems, which prevents companies from having a single, unified view of their data assets. It is also affected by reliance on manual data entry and collection, which introduces typos, omissions, and other errors.

A less obvious risk to the integrity of data is a failure to maintain audit trails that allow managers to track the history of the data from collection to disposal. Another threat to data integrity is reliance on legacy systems that may use incompatible file formats or lack critical functions. Such systems are the source of data silos and redundancies.

A lack of data integrity can lead to poor decision-making, reduced productivity, and increased risks, including data breaches, compliance violations, and dissatisfied customers. It can also cause legal liabilities that can be expensive and time-consuming to address.

One technique for minimizing these and other data integrity risks is to use redundant data storage that duplicates data operations in two separate locations and compares them to reveal inconsistencies. This reduces the likelihood of inconsistent data going unnoticed. Another is the application of versioning and timestamps as part of the auditing process to facilitate tracking changes and reverting to known-good data in the event of a failure.

Knowing you can trust the accuracy, completeness, and consistency of the data your business relies on gives managers greater confidence in their analyses and resulting decisions. By integrating data integrity checks into your everyday business processes, you’re able to better maximize the revenue-generating potential of your organization’s data.