To say data has “integrity” means that it can be trusted and relied upon and is ultimately useful. Data integrity also conveys a sense of unity and completeness. The greatest challenges to ensuring that data has integrity are any characteristics or events that detract from the data’s usefulness, trustworthiness, and reliability, as well as anything that disconnects data elements intended to be cohesive or renders the data incomplete.
A great number of factors can cause data to lose value. Common data integrity threats include errors in data entry, duplicated data, lack of timely updates, and physical or logical corruption.
Healthy data systems use multiple data integrity checks to ensure errors, omissions, and ambiguities in data are identified and corrected to confirm that all of your organization’s data integrity requirements are met.
What Is Data Integrity?
Data integrity indicates that the data is accurate, complete, consistent, and valid. The more important data becomes to organizations of all types and sizes, the more dire the consequences of basing decisions on data whose integrity is questionable.
Data integrity is a vital component of data quality, which is a more expansive concept that considers how fit the data is for specific use cases. Data quality confirms the timeliness and uniqueness of the data as well as its accuracy, completeness, and validity.
The two broad categories of data integrity are physical and logical:
- Physical integrity relates to errors resulting from damage to the hardware the data is stored on. The errors can be caused by equipment failures, loss of power, and other external events.
- Logical integrity encompasses internal data processes within information systems, such as storage, retrieval, and processing. Relational databases check for referential integrity, which refers to the consistency of relationships between tables.
Data integrity is related to but separate from data security, which focuses on preventing harm to data due to unauthorized access, theft, or damage. Data security applies safeguards to protect private data and covers data backup and recovery as well as data encryption, access controls, and intrusion detection systems.
Fixes for the Most Common Threats to Data Integrity
Software failures. The CloudStrike software outage that struck on July 19, 2024, is an extreme example of the damage that can result from a failed software update, but any application crash can wreak havoc on the integrity of a business’s data. The U.S. Federal Trade Commission points out that software crashes provide companies with an opportunity to enhance the resilience of their data systems by bolstering their APIs, which are key to preventing small, isolated software glitches from spreading to other systems within and outside the organization.
- Approaches to minimizing the occurrence and impact of software bugs include adopting development and maintenance processes with built-in bug detection. Common sources of failures include implementation bugs that appear on the app’s login page, for example, rather than in the code itself, and specification bugs that are introduced during the requirements and design phase of development and conflict with the app’s underlying security processes. In addition, an absent specification, such as failure to specify HTTP, could be the source of a software failure.
- One technique that helps reduce bugs in software is test-driven development, which writes tests for the software before it’s developed. This allows each feature of the program to be tested as the development proceeds and reduces the chances of a key component being untested or under tested prior to release.
- Similarly, continuous integration/continuous testing (CICT) tests all code changes automatically using specified test cases before the code is added to the central repository.
- Lastly, behavioral-driven development relies on a domain-specific language (DSL) to enhance communication during software development by allowing tests to be devised using standard English or other language rather than having to delve deeply into coding syntax.
Network disruptions. An ISP outage or the failure of a router or other network device can corrupt an organization’s data by interrupting pending transactions, losing any unsaved application data, and making its systems more vulnerable to a cyberattack. Reducing the likelihood of a network outage and its potential impact begins with use of proactive network monitoring tools that measure the performance of individual applications and the network infrastructure as well as the operation of the network itself. The four most common types of network monitoring tools are performance monitors, availability monitors, traffic and bandwidth modelers, and security monitors:
- Network performance monitors collect data from such sources as the Simple Network Management Protocol (SNMP), network flow data, and packet data to provide an overall snapshot of the network’s current operational status.
- Availability monitors are also called uptime monitoring tools and feature real-time alerts of site or network outages, as well as automated downtime diagnostics. The products can track a network service’s uptime guarantees, confirm the validity of SSL certificates, and integrate with incident management systems.
Manual data entry and other human errors. To err may be human, but it’s also potentially disastrous for data integrity. Among the primary causes of data damage by people are accidental file deletions, social engineering schemes of malware purveyors, and data migrations that overwrite current data with its out-of-date predecessors. The mistakes made by IT staff can cause more damage than those of individual users, especially those related to misconfigured security settings and backup errors.
- Tips for preventing human errors that can lead to loss of data integrity include using root cause analysis to identify the processes that generate the most user mistakes, and automating manual processes whenever possible. Errors can be reduced by applying continuous employee training on data practices, enhancing employee oversight and accountability, adding built-in process checks, and using least-privilege access controls to the most sensitive and error-prone data operations.
Malware attacks. Most malware purveyors seek monetary gain by stealing data and either holding it for ransom, selling it to other criminals, or making it public, but many also attempt to damage or alter a business’s data. This includes changing the destination and amounts of a financial institution’s payments, applying microcharges to accounts that can go unnoticed, and inserting links to malware on public websites.
- Malware prevention combines antivirus and anti-malware software with regular software updates, scanning of email attachments and downloads, and use of network firewalls. Other steps are requiring strong passwords, enabling two-factor authentication (2FA), and performing regular data backups. Perhaps the most important method of keeping malware at bay is user education about how social engineering operates and how to spot and respond to possible attacks.
Server crashes. Server maintenance is complicated by the 24/7 operation of most cloud and network server hardware. Among the most common causes of server failures are overheating, hardware defects, poor housekeeping in the server room, and power outages.
- Techniques for avoiding data damage due to a server crash include ensuring the scalability and performance of your relational DBMS, investing in redundant hardware such as RAID systems, sticking to a regular data backup schedule, and keeping your data recovery plan up to date.
Lack of data integration. Few modern information systems operate in isolation, but as integration becomes a component of all data processing, incompatibilities, duplications, and other problems can corrupt individual data sets and hinder the operations that rely on them. These are among the most common integration problems:
- Failure to clearly define integration requirements. The requirements must consider the target system’s processes, as well as the expected data volume and transaction speed. Other considerations are end-to-end security, error and exception handling, and data contention (record-locking and change rollbacks).
- Poor integration design. The integration plan may use the wrong methodology or fail to identify potential technical limits. APIs must accommodate all integration use cases, and the integrated data has to scale as user and system requirements change.
- Insufficient integration infrastructure. Confirm that the middleware that allows data to run on different platforms is available and supports all required features. Both source and target systems have to support the protocols and techniques used to prevent unnecessary processing and resource consumption.
- Incompatibilities in the data itself. Before the integration, data sets must be checked to confirm they are complete and accurate to avoid introducing errors into the target system. The data has to match the defined input parameters, volumes, and update frequencies of each system, and must also consider underlying referential integrity needs.
Use of multiple analytics tools. Most companies use many different data analytics tools, but the products don’t always work well together. Common errors related to use of data analytics include incompatibilities between legacy and newer systems, data sources using different formats, and inconsistent content, such as downloads and installs being placed in the same category.
- The first step in preparing data for analysis is cleaning and preprocessing after it has been collected and aggregated to identify and repair low-quality data. Mismatches should be discovered upon collection and prior to any analyses to prevent corruption of the results. Analytics tools function best with a uniform set of data formats and structures.
Insufficient data auditing. To be effective, data audits must have clear objectives, identify all data sources, map data flow, perform a data quality check, and confirm that adequate security and compliance measures are in place. Failure to plan and implement complete and consistent data audits allows errors to propagate and limits the effectiveness of data analyses.
- Successful data audits begin by establishing goals, setting the audit’s scope, and collecting all necessary tools and resources. Once all data and sources have been identified, a data quality assessment is conducted and the data’s security is assessed. The audit determines how the data is used and managed, and generates a report that covers compliance, quality, and suggestions for improved management.
Reliance on outdated legacy systems. Businesses may underestimate the technical debt they incur by failing to upgrade their legacy information systems. This debt is the result of the implied cost of the added work required to use and maintain outdated technologies compared to their more advanced and more efficient replacements. In addition to their age, legacy systems often lack necessary documentation and expertise, which complicates updates and integrations.
- Eliminating the technical debt resulting from legacy systems begins with an assessment that inventories and documents existing hardware and software, followed by a plan for modernizing the systems to enhance compatibility, performance, maintenance, and security. Code can be restructured and optimized without altering its capabilities, and automated testing added to test data integrity whenever it is transferred or integrated.
Incompatible software. Data managers know that software incompatibilities can introduce security vulnerabilities into systems and impede performance. However, many organizations don’t realize the risk of data loss and corruption posed by incompatible software. Unpatched and poorly maintained software is more prone to the crashes and failures that are a primary cause of lost data.
- The best approach for reducing risks to data integrity caused by software incompatibility is to automate updates that patch security holes as quickly as possible after they’re discovered. Organizations should avoid using end-of-life (EOL) software, which are programs that are no longer supported by their vendors. They must also confirm that the updates are being downloaded only from trusted vendor sites over secure network connections.
Data drives business decisions, so the link between data integrity and higher-quality decisions is a straight line. Managers, employees, and customers have to trust the data that underlies all modern business processes, but once that trust is lost, it’s an uphill battle to regain it. A proactive approach to data integrity helps instill confidence in the data among data consumers within and outside the organization.