Data catalogs and Data Governance work together and intersect in some very useful ways. Data catalogs communicate information about an organization’s data assets, and where they are located. Data Governance, on the other hand, deals with the overall management of data, such as accuracy, usability, security, and the established processes the organization uses.
Data Governance programs often include data catalogs as a key part of their overall design.
Data is organized and arranged into a simple format by data catalogs, which allows users and researchers to easily recognize and process the data. The catalogs use metadata to provide an organized inventory of a business’s data assets, including the data stored in data lakes and data warehouses. (A good analogy is a library catalog, which provides a brief description of the book being sought and its location.)
At the most basic level, Data Governance and data catalogs intersect in their use of data and data sets. (Data sets are packages of data, or “data packages.”) Data Governance dictates the processes, while data catalogs focus on cross-connecting data packages.
Other places where Data Governance and data catalogs intersect include:
- Metadata
- Data lineage
- Machine learning
- Legal compliance
Emily Washington, a senior vice president for product management at Precisely, recently stated,
“Data catalog solutions have gone from a ‘nice to have’ to a ‘must have’ in the arsenal of data integrity capabilities. In order to achieve trusted data, it is imperative that organizations take a close look at how they are deriving trustworthy business intel from their catalog. A centralized, single source of data knowledge aids in delivering business context that helps leaders make confident decisions.”
Improving Data Quality with Data Catalogs
The data catalog improves Data Quality within the Data Governance program.
Using a data catalog helps organizations to manage their data. It also helps to enrich metadata, in turn supporting data discovery and Data Governance. Data catalogs provide researchers with a single source of truth when data questions arise.
The primary purpose of a Data Governance program is to assure data is secure, safe, and of high quality. It supports developing controls and then enforcing them. A primary goal of Data Governance is to promote good Data Quality (the accuracy of the data) and includes any activities or processes that help to ensure data is suitable for use.
Data Quality is normally measured using six dimensions: accuracy, validity, completeness, consistency, uniqueness, and timeliness.
Without a data catalog, researchers have to find data by sorting through various data packages, speaking with colleagues, and using tribal knowledge. (Or they can limit the research by relying on data they are already familiar with.) Data catalogs make research much more efficient.
Data catalogs help to get accurate data, quickly and efficiently, and should eliminate redundant data through the use of metadata (which, in the long term, saves on data storage and management costs). They provide useful information about stored data and assist in data analytics. Because the metadata within the catalog must also be governed, it makes sense to coordinate this process through the Data Governance program.
Data Catalogs Require Metadata Governance
A Data Governance program is a mix of software and human behavior and is traditionally set up with a data steward as part of the program. The data steward is responsible for maintaining the Data Governance program, and would normally be responsible for supporting the data catalog. With the assistance of the other staff, this individual is responsible for defining the metadata that is collected and developing the metadata that gets used for in-house data packages.
Data catalogs use metadata to provide brief descriptions of files or data packages (making the file or package identifiable) and their location. The underlying metadata (per the data steward) can also offer contextual information, which helps researchers find useful information. (Something as simple as including a date in the metadata can help determine whether a data package is useful.)
The traceability features offered by a data catalog can also promote good Data Quality by providing the ability to track and correct errors in the data.
Data Lineage, Data Governance, and Data Catalogs
The process of observing and understanding data as it flows from its source, and to its consumption, is called data lineage. This includes all the transformations the data has undergone. It shows how the data has been transformed, what has changed, and why it changed. Data lineage tracks the data’s movement from user to user and system to system, providing a trail throughout the data’s lifecycle.
The data lineage process is supported by the data catalog and allows businesses to track and correct errors in the data.
As the data catalog is a part of the Data Governance program, and data lineage supports the storage of high-quality data, this merging of systems and goals creates an intersection, and provides information that helps managers make informed business decisions. Accurate data can produce meaningful insights and support better decision-making.
Incorporating Machine Learning
Implementing a machine learning (ML) augmented data catalog solution will improve the efficiency of a Data Governance program and analytics software.
ML-augmented data catalogs support the use of metadata to automate data integration, Data Quality, data preparation, and a variety of other Data Management activities. This, in turn, improves the overall efficiency of a Data Governance program. Because of the improved efficiency, the process of developing business insights is accelerated.
A modern ML-augmented data catalog may be used to establish semantic relationships between data packages using knowledge graphs.
The Legal Compliance Intersection
Europe, California, and Brazil have decided it is important to protect the personal information of their citizens. In Europe, the law protecting them is called the General Data Protection Regulation (GDPR). In California, it is called the California Consumer Privacy Act (CCPA). In Brazil, they have the Lei Geral de Proteção de Dados Pessoais (LGPD), or, in English, the General Personal Data Protection Law.
Businesses around the world must comply with these protections or face harsh penalties.
The Data Governance program is normally designed to protect personal information with policies that prevent the casual and unethical viewing of a person’s data: It does so through the use of data catalogs. A data catalog supports the Data Governance program in protecting customers’ personal information.
The catalog’s system of metadata descriptions and tagging is used to support the “storage limitation principle.” For regulatory purposes, the expiration date of personal information is normally included with the data package. A data catalog supports the elimination of personal data through the use of metadata tags.
Overview
Data users generally trust the organization’s data if they are within an organization with a Data Governance program that includes a data catalog. Data catalogs support and process the policies of a business’s Data Governance program, as well as legal regulatory requirements. Data catalogs can also support the needs of researchers by providing self-service data discovery. Data catalogs include machine learning, and artificial intelligence can accelerate the processes of metadata collection, tagging, and semantic inference.
The most important value of data catalogs, however, is the improved productivity of data teams because data catalogs can promote collaboration. Many organizations have their data stored in silos, with the data essentially hidden from researchers. Data teams can spend huge amounts of time attempting to find the data. Data catalogs eliminate, or at least minimize, the use of data silos, making data more accessible to all.
Image used under license from Shutterstock.com