For the 72.29% of companies that want to gain business insights through analytics, knowing what and where to find their company-wide data becomes essential. To solve this need, many organizations implement data catalogs – comprehensive directories that list and describe all available enterprise data. Not every data catalog functions the same, and ones that use generative AI (aka “smart catalogs”) differ from older, more traditional catalogs.
To better understand smart data catalogs, how they increase productivity over traditional ones, their challenges and workarounds, and their future, we spoke with Dr. Juan Sequeda, principal scientist and head of the AI Lab at data.world.
Data Catalog Fundamentals
Any data catalog has fundamental functionalities that operate independently from generative AI capabilities. Both smart and traditional data catalogs:
- Bring in metadata to identify corporate-wide available data sets
- Have an interface where consumers can find out what datasets exist for the topics that interest them and retrieve that data
- Filter and drill-down searching where consumers get descriptions, lineage, and understanding of retrieved data sets
- Point to related data sets, for a topic, through profiling and tagging
- Show where to go for the data and metadata that users find and can access
Additionally, data catalogs require work to update and maintain their quality and usability. Operations include inputting metadata, managing data through its systems, and retrieving datasets of interest.
Typical Data Catalog Roles
Data catalog operations involve producers, who create, edit, update, and remove the metadata inputted into the catalog, and consumers, who retrieve and read data to answer questions or explore issues.
Producers consist of data stewards and data engineers who deliver analytics. Consumers cover data analysts or scientists who want insights from the data.
Smart Data Catalogs Improve Productivity
Sequeda explained how generative AI, which leverages conversational, chat-oriented interfaces to surface results from large language models (LLMs), improves productivity and encourages the adoption of a data catalog. With more traditional data catalogs, administrative tasks require more significant manual interventions, time, and some advanced skills and analysis.
Smart catalogs remove these barriers by simplifying and automating some of the administrative workflows. As a result, team members in an organization see faster time to value and find it easier to get started with the catalogs.
On the data producers’ end, Sequeda said, “Generative AI automatically enriches metadata around the inputs and provides descriptions and synonyms” in the data catalog, smoothing catalog record creation and upkeep. Also, smart data catalogs give data engineers “code summaries” about catalog queries, reducing the time to do DataOps, including any pipeline malfunctions.
Using smart data catalogs, consumers find inspiration when the generative AI suggests alternative queries from previous searches and patterns of results. Also, generative AI makes asking questions about the data in a natural language easier. So, consumers don’t need to learn and use a programming language to communicate with a smart data catalog.
Customizing the Smart Data Catalog
Smart data catalogs work best by combining an internal business context with the external information it has, which comes from the documents used to train the LLMs. To best capture internal business context, Sequeda advises using data catalog products built on a knowledge graph architecture, a model that moves beyond relational rows and columns to capture the context of relationships between different data elements or “nodes.”
Sequeda stated, “The knowledge graph gives rich, meaningful context and connections between datasets. Combining the LLM with an organizational knowledge graph is the key to capturing the true richness of context in an organization’s business framework, including relationships between data, metadata, people, processes, and decisions.”
Organizations can also control the quality of their smart data catalogs through knowledge graphs. Sequeda advocates for a Data Governance program to get to a high-quality data catalog.
Challenges and Mitigations of Smart Data Catalogs
Organizations face several challenges when using smart data catalogs:
Fine-tuning the LLMs: Organizations may drive toward providing the smart data catalog with the best training experience by fine-tuning data in the LLMs. The advantage of fine-tuning an LLM is that it is trained on your organization’s internal knowledge, not just general knowledge. Sequeda advised against this approach at this moment because:
- A fine-tuned model may quickly be out of date because it would not include the latest data. Sequeda said, “Whatever fine-tuning you do today to an LLM may not be valid for tomorrow.”The organization may not have enough data to impact the fine-tuning.
- The business needs to assess the ROI on fine-tuning an LLMs, which includes not just infrastructure setup but also an appropriate team.
“Consequently,” Sequeda noted, “unless a company has lots of internal data stored and protected and can support the time and money costs, a business is better off focusing on prompt engineering.”
Incorrect results/hallucinations: Smart data catalogs leverage LLMs to provide recommendations, which according to a recent MIT study, can increase human productivity by up to 59%. That said, LLMs are not the right approach if you’re expecting 100% accuracy and 100% automated solutions. To mitigate the accuracy limitations of LLMs, a problem described by LLM experts as “hallucinations,” Sequeda recommended having a human in the loop to check the returned results for correctness.
He explained that Data Governance and knowledge graphs will be increasingly critical in validating LLM results. When both are done well, a higher productivity gain and more automation can be attained. He stated, “An organization that gets results from the LLMs should be able to ground them in what they already know, and that construction of knowledge exists in the knowledge graph.”
Deployment structures: Sequeda explained that customers must choose which LLM to use with their smart data catalog and how to deploy it. Customers take full responsibility for a deployment structure that works best with the business and complies with data protection laws.
Customer options include:
- Selecting a customized LLM
- Hiring a vendor to do the LLM setup
- Going with a vendor-recommended LLM with data that the customer can control completely
- Choosing a customized LLM where another vendor like Microsoft Azure provides the data storage
- Setting up a walled garden with approved external partners and managing that set of LLMs as a group
For whatever option the customer chooses, that business “will be responsible for sharing a key/s to its LLMs with the data catalog vendor.” That vendor will provide the catalog engine combining metadata inputs with retrieval requests.
Use Cases: A Two-Way Interaction
Despite challenges or workarounds, smart data catalogs benefit users by encouraging natural, chat-based conversations with data. Sequeda sees these two-way interactions between users and the catalog as a self-reinforcing cycle.
Users learn quickly and ask more questions from the smart data catalog’s recommendations and assistance. Meanwhile, the smart catalog gets even more intelligent, providing better guidance to the user.
He provided two examples of how this data conversation works:
- An organization’s knowledge graph contains a node that is a view of the data, say a chart. Many applications and systems use this view, including the executive’s dashboard. The smart catalog identifies that no data steward is actually assigned to that view, posing a risk that this chart is not trustworthy or governed. It pings the Data Governance Council and recommends a steward for that view. The user responding to the smart data catalog asks for suggestions. The smart catalog replies with the name of an existing steward who has worked with a similar view.
- A data producer finds system A, in the smart catalog, with a specific configuration. That person also oversees migrating data to system B, which uses similar technology. The smart data catalog finds connections between the people working on system A and their skills. It recommends these people to the data producer. Through additional dialog with the smart data catalog, this supervisor knows whether the organization has enough people to migrate data to system B. Moreover, the data producer gets information about potential skill gaps and decides whether to hire other people or do additional training.
The Brain of the Organization and a Platform
In a future vision, Sequeda likens smart data catalogs to “the organization’s brain where users ask any question.” For example, a data consumer may ask technical questions to show all the tables about a topic. Then again, another data consumer may request the smart catalog to return all the participants in an organizational decision that drove a sharp revenue increase.
In the short/medium term, Sequeda thinks a vendor like data.world will have a greater understanding of the tasks needing the assistance of smart data catalogs by testing hypotheses and co-innovating with customers. In the longer term, smart data catalogs will contain the organization’s knowledge, including the data and all its relationships with each other.
Consequently, smart catalogs and their users will better connect the organization’s people and employees, business processes, decisions, and customers. That sets up the smart data catalog as a platform, with applications geared towards specialized search and discovery, Data Governance, DataOps, operational excellence, and more possibilities for individual teams.
The smart data catalog will continue to improve productivity in new ways, promising businesses better access to data for insights. With this advantage, many companies can make good decisions quickly.
Image used under license from Shutterstock.com