Click to learn more about author
Anand Govindarajan.
During the days of building data warehouses and BI systems, all that was captured, as truly “metadata,” was the data about creation/access of the data structures for audits, data about the execution of data loads, and so on. Primarily, whatever supported the “operations” of the data warehouse. During the Data Modeling phase, the description of the entities was captured in the modeling tool to help with the discussions, but once the model was approved and implemented, nobody really bothered to keep it updated.
Some of these descriptions went into the physical columns, as most of the database products do support comments for the physical structures. Attempts were made then to reverse engineer and publish the physical data structure details in a portal for ease of designing the ETL job creators and for the BI teams to query the warehouse for the report generation. These were the real foundations of capturing “operational,” “business,” and “technical” metadata, respectively, playing more a passive role of infrastructure support than anything actionable.
I never realized that one day the same metadata would become as “powerful” and “actionable” as the data it describes. In fact, without this metadata, even the data it describes loses its power as there is no real “context” to that data. Implementing Data Governance initiatives for my clients for the past five years has only increased this realization.
Let me share in this blog what some of these real “actions” are based on these implementation experiences while we delve into the details of “how” in subsequent blogs.
The table below lists some of the “actions” various types of metadata can drive.
Data Literacy: With the increasing need for users to “find” and “understand” the data they are consuming from various data lakes and data stores, the business metadata that describes this data becomes particularly important. Data catalogs provide this literacy, tagging all the contents with the right glossary elements — captured through collaboration or by auto-tagging through ML algorithms.
Data Protection: Increasingly, organizations face the challenge of complying with data privacy and protection regulations such as the GDPR, CCPA, and so on. Data classification algorithms drive this sensitive data identification and tag the content with business metadata, such as the PII tags. Organizations can then use this metadata to drive various access policies for the data.
Data Redundancy Check: As important as building trust in data is, there is a need to improve the efficiency in Data Management to get the agility the business needs from the data infrastructure. Eliminating data redundancies created due to older practices is one way of doing this. The business metadata that are tagged to the data is another means to identify redundancies.
Analytics Governance: One of the key focus areas in organizations is to ensure the consumers of analytics see trustable content that they can use to drive business decisions. Technical metadata captured from the data sources to the reporting/analytics platforms can be tied together to understand the data provenance and, hence, the quality of the output. This can drive concepts like ”report certification” that improve the trust in the final content consumed.
Data Provenance and Impact Analysis: As described above, the technical metadata from various platforms can be tied together and can help the platform development teams understand the impact of the changes. For example, what happens if the structure of a table in a data source is modified? What are all the reports impacted? What is the impact if the data integration platform has to be migrated?
Development Standards: Technical metadata from reporting platforms such as the SQLs are embedded in the reports; those from data integration platforms can indicate bad programming practices. Those can then be used to drive the creation of development standards to increase efficiencies.
Fraud Analysis: System access logs are a great source of information to understand fraudulent transactions. Similarly, metadata around email exchanges between parties provide useful input to understand their role in fraudulent activity.
Infrastructure Maintenance: Logs that provide resource utilization of various applications and processes act as a key input for infrastructure maintenance. The data points can even be fed into a machine learning algorithm to predict possible failures and drive what is called the “preventive maintenance” of these servers.
Usage Analysis: As organizations look to modernize their analytics platform, inputs such as most used reports, data sources, etc. help rationalize the objects to be migrated from the older platform to the new, rather than blindly transferring everything from the older to the new platform.