When training AI models, the accuracy of the AI app depends on the quality of the training material it receives. Naturally, feeding it more than it needs or not enough will be either costly or result in a poor model, respectively. When using AI, you want your results quickly and with minimal cost. The best way to do that is to feed it just the data you need. Yet, given the size of unstructured data – multiple petabytes in most enterprises – and its distribution across storage silos, it’s difficult to curate and segment specific data sets.
Enter metadata, which is data about data. Metadata is created automatically by storage technologies and offers better insights on your data, such as: who owns the data, what file type it is, where it lives, who accessed it, and so on. This system-level information is extremely useful for managing data, but it lacks the additional context that users and applications often have.
Additional metadata can enhance the information such as through tagging data by its contents (clinical images showing breast cancer versus pancreatic cancer or images of celebrities or alumni), tagging sensitive information or information related to a project, geography or demographics (research on females in the Northeast region), or related to a particular initiative (manufacturing test data from product X in 2022). Metadata brings structure to unstructured data, which can vastly aid the effort of finding the right data for use in AI tools.
The Benefits of Machine Learning Augmentation of Metadata
Managing and enriching metadata is a time-consuming process that requires collaboration between IT and departments – data scientists and data owners – to tag data accurately. Tagging adds additional metadata to your file data in the form of key-value pairs, which give context to your data. One example of using multiple tags on a file is: Country = US, Project ID = 123, HIPAA = TRUE. Yet tagging across large data sets manually is virtually impossible. Machine learning-based automation will play a growing and important role in these efforts. Here’s how:
- Machine learning algorithms can help identify and correct errors or inconsistencies in metadata, improving its overall quality.
- Machine learning can help automatically tag and categorize data, improving its search, usability, and manageability.
- Enriched metadata delivers new possibilities for business insights from AI, such as for example, sentiment analysis of customer service interactions or discovering new causes of a common medical condition.
- Machine learning can improve compliance, by identifying data that is not secured or stored according to regulations or analyzing data access patterns that may be in violation of corporate policies.
- Efficiencies and cost savings from reduced manual efforts and fewer errors in managing metadata.
- Competitive advantage from better overall use of data to make more informed decisions or even to unlock new revenue streams. The lion’s share of enterprise data is not leveraged for any purpose but hidden away in storage silos and consuming expensive storage capacity. Metadata can enhance data quality and make data more discoverable for new uses.
Enriching metadata is much more effective with a data management system that can provide that information no matter where the data lives. This way, you do not have to run the AI/ML algorithm repeatedly each time you need the additional context. The enriched metadata lives as long as the data lives. A storage-agnostic data management system can maintain an index of this metadata as your data moves from one storage system to another and provides a simple way to search, curate, and extract the right data based on this enhanced metadata.
Industry Examples
Name an industry and you can imagine how metadata augmentation can deliver powerful benefits. Let’s look at the auto sector. Electric and autonomous vehicles collect large quantities of sensor data, which helps the car adjust and take actions on the fly or issue alerts to the driver. The analysis of this data is white gold for manufacturers for product enhancements and customer behavior analysis.
Using an unstructured data management system, a car manufacturer could create a workflow like this:
- Find crash test data related to the abrupt stopping of a specific vehicle model
- Use an AI tool to identify and tag test data with “Reason = Abrupt Stop”
- Move only the related data to a cloud service for analysis
- Delete the unrelated data or move it to another cloud service for archives
- The process could run continuously as needed
Here are other examples:
- Improving customer support: Consider a technology company that uses a machine learning program to run sentiment analysis on call center recordings. The results, such as customer satisfaction scores, are recorded to each audio file with a tag. Now employees can find relevant audio recordings for training and managers can improve best practices.
- Medical imaging search: A hospital could apply machine learning to medical images like MRIs, X-rays, and CAT scans and then tag the images with diagnosis codes. Researchers can then find images by diagnosis to support their projects.
- PII detection and protection: Personal data such as HR files, patient data, and financial information could be present within a small subset of the billions of files under management at an enterprise. There’s no easy way to find and isolate it continuously. But if a machine learning program like Amazon Macie could analyze data sets for PII, and then a data management system could tag as “PII” and send them to secure, immutable storage (or delete, when possible), it saves ample time and reduces risk of a breach and fines.
- Image search: The marketing leader at a university wants to find images for different campaigns and delete images in its content library that might be inappropriate. The department can use an image AI program that analyzes and tags the images with relevant identifiers so that they can be easily discovered later when needed for different projects. The new metadata tags are stored in a data management system and follow the files even if they move to new storage. The same process could apply to genomics processing, for lab images.
- Surveillance/law enforcement: Unstructured data, such as body cam and dash cam video, along with social media posts and text messages, are important pieces of evidence for criminal investigations. During a case, those files are in active use but once a case has been closed, they may be hard to find later if the case reopens or if there is a need to analyze them for new purposes – such as for crime prevention, training, or for use in research projects to improve safety. AI can analyze files and tag them as needed to support those future initiatives.
- Copyright protection via metadata: A hot-button topic with generative AI is that copyrighted materials such as artwork, images, or books wind up in the training models of programs like ChatGPT. Lawsuits have been on the rise in the wake of this issue. One possible solution is to use tools like Digimarc. This will allow copyright owners to apply metadata in the form of a digital watermark to their works, which AI models can detect before ingesting it.
Technical Considerations
A metadata augmentation project can get out of hand quickly. If you create too many new tags, you must store and manage them appropriately to avoid performance issues with user access. Most IT organizations will need to implement automation for metadata management, given the volume and variety of metadata today.
It’s best to use software that uses a combination of queries and tags. Queries deliver results for common inquiries such as: “Show me all data owned by this department that has been accessed in the last six months.” Users can create any custom queries based on the available metadata. Tags are not needed to save these queries but are used only to enhance the available metadata information using machine learning or user-driven inputs. This query plus tag approach maximizes efficiency, saves time, and eliminates the issue of tag proliferation.
It’s also wise to be selective on metadata augmentation. Even with the help of machine learning tools and other systems, it takes time and resources to curate the right data for enrichment, monitor the results for accuracy, safeguard the data from misuse, and work with data stakeholders to ensure that more metadata is serving their needs rather than making an AI project more complex or producing false or inaccurate findings. Yet, by spending time and using the right tools and resources to understand and properly leverage metadata, IT leaders and data stakeholders can lay the groundwork for a stronger, more relevant AI and big data analytics program.