Over time, different types of data integrity systems and methods for promoting data integrity have been developed. Data integrity emphasizes confirming the data remains unchanged and consistent over the data’s entire lifecycle. In essence, the data remains pure and uncorrupted. Security plays an important role in ensuring the data is not altered and maintains its integrity. Data integrity, and the security needed to protect it from being altered, has become an important part of designing and implementing data systems.
The goal of data integrity is to ensure that data remains unaltered and to prevent any malicious or unintentional changes to the data.
By providing a high level of data integrity, or using uncorrupted data, an organization can improve its decision-making process regarding policies and resource allocation. It is an important aspect or subdivision of Data Quality, and ensures the data being used is unchanged and trustworthy. To improve and maintain data integrity, procedures can be implemented that prevent the data’s corruption.
Data integrity is normally focused on data that is generated in-house, rather than data that has been collected from outside sources for research purposes. However, data integrity is also a concern for data collected from outside sources (it’s just not as easy to control).
Data integrity is particularly useful in situations requiring the original information to remain accurate and unaltered, such as a contract. Other examples of situations requiring unaltered data are financial systems (for example, bank records) and healthcare databases (such as medical records). Data integrity is considered an important feature of database management systems.
As the amount of data being stored steadily increases, the impact of data integrity also increases. Several large businesses have become reliant on the ability to trust the content of their data, and data integrity is critical in developing trust in the business’s data.
Improving data integrity enhances a business’s ability to trust the data they are using to make decisions and honor agreements.
Data Quality and data integrity can be easily confused. While the two are related concepts, they have separate focuses and approaches. Data Quality deals with ensuring the data’s overall accuracy, reliability, and relevance, while data integrity focuses on ensuring the data remains unchanged and secure. In reality, data integrity is an important subdivision of Data Quality and necessary for a business’s success. In fact, it is so important, it is often considered a separate concern.
Focusing on both data integrity and Data Quality will help maintain a reliable and trustworthy data ecosystem that supports various business functions effectively.
Data Integrity in Different Industries
Below are types of data integrity across various industries:
Education: Educational institutions need historically accurate student records. Altering grades or graduation dates results in misinformation. Data integrity is also used for other purposes, such as managing enrollment and grant distribution.
Manufacturing: Data integrity is an essential feature in modern manufacturing. Small inconsistencies in the data can produce significant problems, making data integrity (saving the original designs and measurements) extremely important. Allowing the original files to become distorted could result in a disaster (downtime, chaos, the resetting of machinery).
Healthcare: Maintaining data integrity in a patient’s records is remarkably important. These records contain information about the patient’s allergies, medical history, etc., and are used in diagnosing and treating the patient. Missing information or errors can result in a misdiagnosis, incorrect treatments, or life-threatening prescriptions.
Finance: Financial institutions require precise transaction data for their decision-making processes. Data integrity helps financial institutions validate the accuracy of their accounts, and perform risk assessment and fraud detection. By ensuring the accuracy of their customers’ information, financial organizations can maintain regulatory compliance and protect their reputations.
Issues and Risks
Data integrity can be threatened and damaged by a variety of issues. One threat can damage trust, but a combination of corruptive influences can create total chaos.
Here are a few data integrity issues and risks many organizations face:
Compromised hardware: Power outages, fire sprinklers, or a clumsy person knocking a computer to the floor are examples of situations that can cause the loss of vital data or its corruption. (Backup storage systems, such as hard drives or the cloud, can be used to save data (and the data’s integrity). Security considers compromised hardware to be hardware that has been hacked.
Cyber threats: Cyber security attacks – phishing attacks, malware – present a serious threat to data integrity. Malicious software can corrupt or alter critical data within a database. Additionally, hackers gaining unauthorized access can manipulate or delete data. If changes are made as a result of unauthorized access, it may be a failure in data security. As a business expands, security becomes more important.
Human error: A significant source of data integrity problems is human error. Mistakes that are made during manual entries can produce inaccurate or inconsistent data that then gets stored in the database.
Data transfer errors: During the transfer of data, data integrity can be compromised. Transfer errors can damage data integrity, especially when moving massive amounts of data during extract, transform, and load processes, or when moving the organization’s data to a different database system.
Data Integrity for Databases
There are five basic methods that support data integrity within the database. These methods can be used to maintain the health and integrity of the business’s digital information. The types of data integrity are listed below:
Physical integrity: This method of dealing with data integrity focuses on protecting data from being damaged or corrupted physically by hardware failures or the surrounding environment. Redundant storage systems can be used to prevent damage to the data, or corruption of the data.
Referential integrity: Provides consistency between tables in a relational database. It does this by requiring foreign keys match the primary keys used on corresponding tables. If an orders table in the database references a customer’s identification from the customer table, the software used for referential integrity will stop a new order from using an invalid customer identification.
Entity integrity: An entity can be any thing, person, or place recorded within a database. In the database, each table will represent an entity, with each row of the tables will represent an instance of the associated entity. Identifying each row in a table requires a primary key, and this primary key has a unique value. The process eliminates duplicate records in a table.
Domain integrity: A database’s domain integrity refers to the use standardized ways to data is input and read, and the variations that are not allowed. For example, when a database is using monetary values, two decimal places (for cents) is acceptable, but three are not.
User-defined integrity: This method deals with the customization of business requirements and rules that are not normally covered by other forms of data integrity. User-defined integrity supports rules and constraints that have been created by the user to perform specific tasks.
Data Integrity in the Cloud
The integrity of data stored within a cloud system is based on the assumption data should not be modified, lost, or distorted by unauthorized users.
Cloud storage services offer massive data storage space, which must be considered when developing plans for preserving data integrity. Two systems for maintaining data integrity have been developed: Proof of Retrievability (POR) and Provable Data Possession (PDP). Both systems are designed to accomplish the same goal: Ensuring the data integrity of outsourced data stored within cloud.
Issues with Data Lakes and Warehouses
Generally speaking, data integrity deals with data that is generated in-house, and doesn’t, or shouldn’t, change.
Data stored in a data warehouse or lake is often collected from outside sources and used for purposes of research. As a consequence, the biggest concern regarding data integrity stored in a data lake or warehouse is the state of the data before it is collected.
When gathering data from outside sources, a process called ETL – extract, transform and load – is normally used. During the ETL process, massive amounts of data stream into the data lake/warehouse, coming from a variety of sources. Within this collected data, corrupted, or altered copies of the original data files will be collected “with” the originals. This may distort the integrity of data that has been collected for research purposes.