Click to learn more about author Oksana Sokolovsky and Rohit Mahajan.
The Data Management category of products began with a focus on Data Integration, Master Data Management, Data Quality and management of Data Dictionaries. Today, the category has grown in importance and strategic value, with products that enhance discoverability and usability of an organization’s data by its employees. Essentially, Data Management has shifted from a tactical focus on documentation and regulatory compliance to a proactive focus on driving adoption of Analytics and accelerating data-driven thinking. At the center of this change is the modern Data Catalog.
The Importance of the Catalog
Data Catalogs began life as little more than repositories for database schema, sometimes accompanied by business documentation around the database tables and columns. In the present technology environment, Data Catalogs are business-oriented directories that help users find the data they need, quickly. Instead of looking up a table name and reading its description, users can search for business entities, then find data sets related to them, so they can quickly perform analysis and derive insights. That’s a 180-degree turn toward the business and digital transformation.
While this newer, more-business positive role for Data Catalogs is positive and progressive, it is not something that comes without effort. A Data Catalog is powerful only if its content is comprehensive and authoritative. Conversely, Data Catalogs that are missing key business or technical information will see poor adoption and can hinder an organization’s goals around building a data-driven culture. But how can enterprises, with their vast array of databases, applications and – increasingly – Data Lakes, build a catalog that is accurate and complete?
Begin to Build
One way to build a Data Catalog is by teaming business domain experts with technologists and go through the systems to which their expertise applies. Step-by-step, table-by-table and column-by-column, these experts can build out the knowledge base that is the Data Catalog. The problem with this approach is that it’s slow – slower, in fact, than the rate at which most organizations are adding new databases and data sets to their data landscape. As such, this approach is unsustainable.
Adding to the complexity, it’s increasingly the case that subject matter experts’ knowledge won’t cover databases in their entirety, and “tribal knowledge” is what’s really required to make a Data Catalog comprehensive and trustworthy. This then leads to an approach of “crowdsourcing” catalog information across business units and, indeed, the entire enterprise, to build out the catalog.
While the inclusivity of such an approach can be helpful, relying on crowdsourcing to augment business domain experts and build an authoritative catalog won’t get the job done. Crowdsourcing alone is a wing-and-a-prayer approach to Data Management.
Enter AI and ML
In the modern data arena, Artificial Intelligence and Machine Leaning must be used alongside subject matter expertise and crowdsourcing, in order to fully leverage their value, and keep up with today’s explosive growth of data. Business domain expertise and crowdsourcing anchor the catalog. Machine Learning scales that knowledge across an enterprise’s data estate to make the catalog comprehensive.
Artificial Intelligence and Machine learning can be used to discover relationships in databases, or Data Lakes, as well as between multiples of these. While some of these relationships may be contained in metadata, many will not be. Machine Learning, by analyzing the data itself, can find these hidden relationships, allowing experts to confirm the discoveries and make them even more accurate going forward.
Leveraging this relationship discovery helps extrapolate expert and crowd-sourced information in the catalog. When business entities are defined and associated with certain data elements, that same knowledge can be applied to related elements without having to be entered again. When business entities are tagged, the tags from related entities can be applied as well, so that discovered relationships can yield discovered tags.
Automate and Verify
This work can be both automatic and governed – Artificial Intelligence and Machine Learning removes the hard work of applying related tags and definitions, but still allows experts to review and confirm such applications. Even columns with similar names can be cataloged in such an automated fashion. Machine Learning assures this similarity is contextual, and not just based on spellings and patterns.
Data Quality rules for business entities can be defined by experts, then used as criteria to assess a column’s conformance to the definition. While this is essentially a validation check, Machine Learning can still be used to assess the match, by judging whether any non-conformant data is merely anomalous or a true indicator of an unrelated column. Inspection and confirmation by experts help make such assessments increasingly accurate.
Closing the Loop
Enterprises need Data Catalogs, to help their team members find the data they need and make better decisions. Data Catalogs enable Self-Service Analytics but, more importantly, raise the efficacy and enthusiasm of information workers using Analytics for business advantage. The Data Catalog is an evangelist of, and confidence builder in, conducting business with the benefit of the data insights.
Building that catalog manually, while laudable, is not an approach that will garner success. Human expertise underlies the accuracy and critical mass of a Data Catalog. But it is Machine Learning that will assure its comprehensiveness, by replicating the expertise across an enterprise’s entire data landscape. Machine Learning lets humans apply their expertise to catalog their data in a smarter fashion. The Data Catalog lets humans apply data insight to business, and enables the business to work smarter, compete harder and win.