Fundamentals of Data Classification

The process of data classification can be broadly described as the organization of data into relevant categories, allowing it to be accessed and protected more efficiently. In the simplest terms, the data classification process ranks data based on its security needs and makes it easier to locate and retrieve data. Classification is especially useful to organizations storing significantly large amounts of data.

Data classification can be used for multiple purposes: data security initiatives, maintaining regulatory compliance, and meeting other business objectives. In some situations, data classification has become a regulatory requirement, with the data being made available to government agencies, who demand it be searchable and retrievable within designated time frames. Because data classification supports easy and efficient searches and data collection, data analysis becomes a more efficient process.

Julia Duncan, a director at the University of Toronto, explained,

“Data is all around us. Data classification helps us to understand the most appropriate ways of handling and protecting it – who can see or use it, where to store it and for how long, whether it can be shared and what protective measures are most appropriate. Whether it is for a research project, as part of data collection, or a day-to-day data use and its sharing for academic and administrative purposes, data classification is a very important step as we continue to strengthen data security.”

The data classification process also eliminates the duplication of data, which, in turn, improves the accuracy of the data (data quality and data integrity).

Data tagging is applied during the data classification process. It is considered an essential step in data classification. These tags are used to identify the data and can communicate the level of confidentiality/sensitivity – for security purposes – and the level of data quality. The sensitivity of data determines its security rating.

Data Tagging

Data tagging identifies data by including the tag within the metadata. A “tag” is a keyword, number, or term that is assigned to a data file. In a business, an employee ID can provide a unique way of identifying individual employees. When the employee number is entered, the search engine presents a single employee, rather than multiple employees sharing a common key word.

Similarly, in a soccer game, a seat number can be used to communicate the assignment of a seat to a specific ticket, establishing temporary ownership. A tagging system within the metadata promotes locating and accessing a data file quickly and easily, and can eliminate any confusion about who “owns” the seat.

Data tagging uses metadata to provide a unique identification process, promoting efficiency.

Tagging data is an essential step in the data classification process. The tags are used to communicate the type of data, its level of sensitivity, and its level of data quality. Sensitivity is normally based on the importance or confidentiality of the data, and aligned with the appropriate security measures needed.

Common Types of Data

Data classification can provide both improved understanding and accessibility to the organization’s data. This situation promotes the use of data analysis and improved data security. The effective use of data classification can help an organization with massive amount of stored data to function more efficiently.

To better understand how data classification works, it is important to understand the most common types of data, which are listed below:

Public data: Provides information that is freely available to the general public to read, research, and store. It typically supports minimal amounts of data security, because it is easily shared and has little risk of damaging individuals, or the general public. Examples of public data include people’s names, news and educational articles, and some government websites.

Private data: Contains information that should not be shared with the public. Sharing this type of information – passwords, browsing/research history, credit card numbers (without pin numbers and expiration dates) – might present a small risk to an individual or organization, and can usually be corrected quickly.

Internal data: Normally, this describes the data used specifically within an organization and relates to an organization’s internal functions. Examples of internal data include business plans, employees’ personal information, emails, and memos. Internal data is often spread out over different levels of security.

Confidential data: Only a limited number of individuals within the organization can access confidential data (sometimes referred to as “sensitive data”). Confidential data access might involve specialized passwords or retinal scans in order to view the content. Examples of confidential data are social security numbers, medical records, credit card numbers with pin numbers and expiration dates.

Restricted data: This is data that, if compromised, can lead to massive legal fines or criminal charges. It typically has very strict security controls to limit access to the data, and often uses some form of data encryption. If it is accessed by people with malicious intent, an organization’s proprietary information could be copied, or made inaccessible, with demands for a ransom. Restricted data may also have the potential to put the general public’s health at risk. Examples of restricted data include intellectual property, protected health information, and some federal contracts.

Methods of Data Classification

The process of data classification normally includes tagging to communicate the type of data, its corresponding security level, and its data quality.

Basically, three types of data classification have been developed:

Content-based data classification: This often focuses on sensitive information – financial records, personally identifiable information – and uses software to inspect and interpret files while looking for sensitive information.

Context-based data classification: Uses software that focuses on context-based information, such as the application, its source location, or the creator, to determine its storage location.

User-based data classification: A manual process that requires the person performing the task to have an understanding of data classification. This form of data classification is significantly slower, and much more error-prone, than the content and context-based data classification systems, which use software.

Datamation has published a review of classification software tools for 2024.

Compliance Standards and Data Classification

A growing number of countries, and some states in the U.S., have created regulations and compliance standards that require businesses and organizations establish a data classification system. Requirements may vary, depending on the country, the organization, and the types of data it is using. Listed below are some examples of why compliance can be a concern.

General Data Protection Regulation (GDPR): Europe’s efforts to protect their citizens’ privacy resulted in regulations that require businesses to classify all their collected data. The GDPR is concerned with data related to race, health care, political opinions, ethnic origin, and the use of biometrics. (Businesses that are not storing massive amounts of data can use a fairly simple classification system – the goal is to provide the requested data to EU officials in a fast and efficient manner.)

Payment Card Industry Data Security Standard (PCI DSS): Created by the credit card industry, Requirement 9.6.1 stipulates that businesses and organizations must “classify data so that sensitivity of the data can be determined.” This is not a law, but a legal agreement.

Health Insurance Portability and Accountability Act (HIPAA): This is a U.S. federal law. It considers personal health information (PHI) to be confidential information, and requires medical facilities to protect the medical records of individuals. The HIPAA Privacy Rule restricts the use and disclosure of personal health information, and requires medical facilities and their associates develop a data classification system.

California Consumer Privacy Act (CCPA): The CCPA states that “data classification should identify which data types are sold, shared with third parties, or used for marketing purposes. Any rights requests for specific data types should also be recorded in the data inventory as proof that you’re CCPA compliant.”

It is important for organizations to research legal concerns, or consult expert advice, when doing business over the internet.

The Challenges of Classifying Data

The data classification process is very useful for in terms of security and data retrieval. However, there are some problems that may develop. Some of the common challenges are:

False positives: This takes place when the same data appears in different contexts and different formats, and the software doesn’t recognize it as a duplicate. Classification software that does not examine the data’s context and format has a higher probability of generating false classifications. Because large amounts of data are normally used in classification projects, even an extremely small false positive rates may distort the classification process.

False negatives: These occur as a result of confusion regarding context. For example, a name would not normally be considered sensitive information. However, when it is part of a medical record, that name becomes sensitive information. Classifying data without an understanding of its context can cause data can be incorrectly classified.

The cost: The price of implementing and operating data classification tools will depend on the number of controls established and the amount of data being processed. Data classification can become quite expensive and cumbersome. Manual efforts to classify large amounts of data can be extremely expensive, with larger amounts of data costing more.

ChatGPT is being experimented with as a tool for classifying data, but there are concerns about the system’s lack of security.

JOIN OUR LIVE ONLINE DATA AND AI ETHICS COURSE