Advertisement

Modern Data Archiving: Managing Explosive Unstructured Data Growth

By on
Read more about author Steve Leeper.

As unstructured data creation rates have soared, the timeframe for active use of data has shrunk due to edge computing, IoT systems, machine-generated data, and, let’s not forget, GenAI. The data use period today has largely been reduced to around 30 to 90 days before the flood of new data appearing makes the existing data either less useful or even redundant.

The constant flood of incoming data requires near-relentless storage system expansion in an increasingly futile attempt to keep up. So, the need to have a modern data archiving strategy is paramount to managing storage and hybrid-cloud systems.

Data archiving is an often misunderstood term, and that is confusing for all involved. It is important to have a clear understanding of exactly what archiving is and what it is not. For example, tiering and archiving are two very different things. Think of archiving as a removal company moving physical documents from file cabinets, placed into boxes, and then transporting them to an offsite facility for long-term storage, after which the moving company leaves and is no longer needed. When the files need to be retrieved, anyone (with permission) can access the files.

In contrast, tiering is akin to a specialized librarian permanently in place to move one file at a time into a unique filing system for which only the librarian knows how to get each file back. Tiering is essentially another name for hierarchical storage management (HSM). Tiering or HSM solutions have been tried in multiple forms over many years, and inevitably, they tend to cause more pain than benefit.

Archiving vs. NAS Cloud Gateways

NAS cloud gateways provide a global file system and, therefore, global access via the public cloud to files traditionally stored in on-premises NAS storage systems. Since the NAS gateway device maintains all file-related metadata in its global file system, it can be claimed to serve as an archive front-end.

But while the NAS gateway considers the content archived, it is not true archiving because the metadata is stored in the NAS gateway so any access to that data in the event of recall is arbitrated by the gateway (very similar to tiering).

Both tiering and gateways present companies with an important concern. What if that solution gets decommissioned or the vendor goes out of business? How do you retrieve data when the app is no longer available?

Archiving vs. the Archive Storage Platform

The “act” of archiving data and the platform upon which the archived data will be stored are two different things. While the platform represents an important decision to be made when creating archiving policies, there is more to archiving than just the target platform.

The ”act” of archiving data involves making critical decisions about what to archive and then moving it based on policies. The right data needs to be found among billions of files based on criteria such as the length of time since a file was last accessed, the length of time since a file was last modified, or the archive of data for a specific user ID.

Identifying these proverbial needles in the haystack, moving them to a suitable archiving platform, and having a vendor-neutral way to retrieve the data quickly are key to a modern archiving strategy.

A Modern Archiving Strategy

When creating a modern archiving strategy, it is important to consider whether to leverage an active archive, a deep archive, or a combination of the two.

An active archive is used for data that has a modest or reasonable chance of needing recall. A deep archive is used for data that has very little chance of needing recall but must be retained for either regulatory compliance reasons or internal governance reasons. The deep archive can also become the next location for data in the active archive that has passed a threshold defined by corporate policy dictating its movement into the deep archive.

The first step in archiving is to get insights about the profile of your files. Luckily, there are ample storage system-assigned metadata that can indicate when content was created, when it was most recently accessed, when it was last modified, and even whether the file is owned by an active or inactive user (orphaned files).

Policies can be created to dictate when files that meet certain criteria get relocated to the archive platform. For example, files that have not been accessed or modified within the last three years are relocated to the archive platform.

Many organizations are surprised that over 60% of their stored data falls into this category. And with petabytes of data and billions of files being stored, this can really add up. Once the data to be archived has been identified, it is important to move it to the new platform in the most efficient and least disruptive manner.

This means using a solution that is fast, scalable and, most importantly, doesn’t lock you into keeping it in place permanently for recall and migration purposes. In most circumstances, each archiving event is akin to a migration. It is a point-in-time activity that should be a one-and-done event, not an arduous ongoing trial with unacceptable future consequences. With these considerations front of mind, organizations can manage unstructured data growth with confidence.