With the flood of data that organizations are experiencing, metadata management is no longer optional – it has become a necessity. The concept of metadata management is fairly new because older metadata services, before this rush of data, had no significant problems locating data files. Now they do.
Metadata, at its most basic, can be described as small amounts of data used to identify larger data packages. Metadata is an abbreviated description that allows search engines to find the requested information. Its primary purpose is to help find and retrieve data. The metadata can include a file’s title, a person’s name (the author or owner), the organization’s name, the name of the source computer, etc.
Using keywords from the data as references, metadata can be generated automatically, but it can also be written manually, allowing the description to be controlled.
Forms of Metadata
Metadata can be used to communicate the nature, the structure, and the context of the data. There are several distinct forms of metadata, primarily based on the researcher’s needs. Some of the more common metadata descriptions are:
- Copyright metadata: CMI, or copyright management information, can be listed in the metadata of images, literature, etc.
- Descriptive metadata: It is used for both discovery and identification, and includes information such as the title, an abstract, the author, and useful keywords.
- Reference metadata: Communicates information about the contents and the quality of statistical data.
- Administrative metadata: This information can describe the resource type, the permissions, as well as how and when it was created.
- Statistical metadata: This is used to describe the processes of collecting and/or producing statistical data.
- Structural metadata: Includes information about the types, versions, relationships (for example, how pages are ordered), and other structural features of the digital materials.
- Accessibility metadata: Describes the accessibility of services and resources.
Metadata can be located in a variety of places. When metadata is being used for databases, it is traditionally stored within the tables and fields of the database. When it is used for files, websites, and images, it is often located in the source code. Sometimes the metadata exists in a special site, such as a metadata repository or a data dictionary.
Versions of metadata have been used for thousands of years in libraries. (In the movie “Sahara,” from 2005, in the library scene, the library’s ancient scrolls were categorized using major events, such as floods and wars. Tags were attached to the scrolls – the metadata – and listed specific dates.)
Metadata can take a variety of different formats and standards, ranging from free text to structured, standardized, machine-readable formats.
New Metadata Management Solutions
As the use of cloud data warehouses, data lakes, data lakehouses, and other cloud storage systems continue to expand, the identification of metadata has become more difficult. When data is gathered from outside sources (secondhand data), often a whole dataset will be collected to access a single, small piece of information. The entire dataset will be saved for purposes of context, and to keep useful information that might have been missed during the initial scan of the data. The freedom to store ever-increasing amounts of data in the cloud has allowed for some extensive collections of data to develop.
Until recently, metadata was generally ignored. It wasn’t a problem until the amounts of data being stored and processed became so massive. As a consequence, metadata solutions did not keep up with the extreme volume of data being used. This had an unexpected impact on the ability to locate data on the internet, and in cloud storage. Storage problems continue to be an issue.
The modern data stack has evolved significantly in the last decade, but not too surprisingly, in the early stages of its development, metadata was ignored. Data stacks have, however, supported some recent advances in metadata management, such as:
- Shifting from passive metadata to active metadata: Passive (traditional) metadata is stored in a static data catalog. The new concept of “active” metadata, on the other hand, allows metadata to flow quickly through the entire data stack. Enriched context is embedded in every tool within the data stack. Metadata is shared, cross-checked, and associated with other data, automatically interlinking data within the network.
- Third-generation data catalogs: Data catalogs are quite similar to the old-style card catalogs in libraries that people used to find books. Third-generation data catalogs are designed for massive amounts of metadata. Previous data catalogs treated data files as discrete, disconnected units with no relationships. Third-generation data catalogs are built above a knowledge graph that focuses on connections and relationships between the data.
- Data fabric: Uses a model that includes active metadata and transforms data into a uniform format before storing it. Data fabric requires the use of metadata to locate, identify, and interlink the desired data files.
- Data mesh: This model relies on a philosophy of only storing and sharing “uniform” data within a community to simplify and streamline its use. Data mesh is increasingly using active metadata to interlink, identify, and locate the desired data files.
Metadata Visualization
The ability to visualize the associations and relationships created during the storage process can be a remarkably useful tool for providing a big-picture perspective. Listed below are some platforms that provide metadata visualizations through the use of dashboards:
- The Alation Data Catalog supports visualizations, reports, and analytics. The platform is described as using machine learning to index and identify a wide variety of data sources including relational databases, cloud data lakes, and file systems.
- The Precisely Data360 Govern offers customizable dashboards. It uses integrated Data Governance tools that include data cataloging, data lineage, business glossaries, and metadata management.
- The Informatica Metadata Manager supports dashboards and uses knowledge graphs. This platform displays relationships by applying AI and machine learning. Active metadata provides the foundation of this platform.
- The Octopai Platform stores and manages metadata in a central repository. It uses a smart engine, using hundreds of crawlers to search all of the metadata and present its results quickly. Octopai is considered a good fit for business intelligence, Data Governance, and data cataloging.
- The Oracle Enterprise Metadata Management platform uses Kibana dashboards through the Peoplesoft database. This platform can collect and catalog metadata from all sources. It will provide algorithms that will list the metadata assets of the data sources.
The Need for Functional Metadata Storage
The storage of metadata needs some significant improvements and is the weakest link in metadata management. Generally speaking, the metadata is never separated from the file. Accessing it means going through the file first to access the metadata.
Apparently, the most basic problem in storing metadata is the lack of an automated system that will copy the metadata from its file and store it separately so it can be accessed easily at a later date.
The Future of Metadata Management
Steve Todd at Dell Technologies presented the idea of metadata lakes in 2015, for insurance companies. The idea was picked up and expanded upon by Prukalpa Sankar, a cofounder of Atlan. She wrote an article in 2021, describing how a metadata lake might work. She developed three characteristics a metadata lake should have:
- It would use open application programming interfaces. This would allow the metadata lake to be easily accessed. This feature would promote using the metadata lake as a “single source of truth” while using modern data stacks.
- It would be powered by a knowledge graph. The potential of metadata is released when the connections and relationships between data files are displayed. The knowledge graph is a very effective tool for showing these interconnections and relationships.
- It would support both humans and machines. The metadata lake should be user-friendly (locating data easily and presenting it in context) and include automation (auto-tuning data pipelines, for example). These features should be included in the fundamental architecture.
However, the basic problem of copying the metadata from its file and storing it separately in the metadata lake will still need to be resolved before metadata lakes become a reality.
Image used under license from Shutterstock.com