The emergence of artificial intelligence (AI) brings data governance into sharp focus because grounding large language models (LLMs) with secure, trusted data is the only way to ensure accurate responses.
So, what exactly is AI data governance?
Let’s define “AI data governance” as the process of managing the data product lifecycle within AI systems. To keep it simple, we can break down AI data governance into two main components.
The first is AI data privacy because any personally identifiable information (PII) or other sensitive data must be protected from unauthorized access and use, accessible to only one user (and nobody else), and comply with data protection laws like the California Privacy Rights Act (CPRA), the General Data Protection Regulation (GDPR), and Health Insurance Portability and Accountability Act (HIPAA).
In addition, bad actors keep trying to manipulate LLMs into giving away sensitive and PII data, by pretending to be someone else and asking the LLM for their credit card number or SSN. This makes data privacy even more important in the GenAI era.
The second component of AI governance is data quality, in two respects, since using data in AI systems is a two-way street: what goes in and what comes out.
What goes in is the data used for training and augmenting AI models, which needs to be clean, complete, and current in order to respond to user queries as accurately and responsibly as possible.
What goes out is the data provided to users in those responses. For users to trust the data, not only should all relevant sources be cited (and clickable), but the model should also be able to explain how it arrived at its decision. The data should also be as free of bias as possible to prevent discrimination.
Ensuring data privacy and quality helps organizations manage risk, build trust with customers, and use AI apps responsibly. That is the essence of AI data governance. Now, let’s take a deeper dive into data privacy and data quality, especially in terms of the challenges these two aspects of governance face, and the things enterprises can do to address them.
AI Data Privacy Challenges
We recently surveyed 300 senior professionals who are directly involved in the planning, building, or delivery of GenAI applications and found that 48% of respondents listed data security and privacy as one of the top obstacles to using enterprise data with GenAI apps.
The challenges associated with AI data privacy can be broken into five separate categories, as follows:
1. Data breaches are also breaches of trust
LLMs, the core of most AI systems, are trained on vast amounts of publicly available external data. Having said that, a new breed of models, the enterprise LLM, can be augmented with your private internal data using frameworks like retrieval-augmented generation, or RAG, for short.
But here’s the rub: Included in your internal data is PII and other sensitive information that are extremely valuable to bad actors, and thus can be vulnerable to cyberattacks. Think about customer data or patient data. Data breaches can reveal the confidential data you store, and expose your company to financial, legal, and reputational damage.
Luckily, proper AI data governance includes sensitive data discovery tools, dynamic data masking, role-based access controls, and data isolation that safeguard your company from data breaches. The key is just ensuring that organizations have proper data governance rules in place to begin with.
2. Data privacy has become a very public issue
Adhering to data protection laws is key, because non-compliance can lead to fines and penalties, as well as loss of customer faith.
To ensure your organization stays in compliance with the laundry list of data protection laws around the globe, it is important to ensure the AI data governance tools you choose have capabilities like data minimization, data anonymization, and rules-based data access built in.
3. Transparency helps explain how AI thinks
LLMs are considered black boxes, making it difficult to understand how they reach decisions. This lack of transparency leads to mistrust, and possible misuse, of generative AI apps, since the accuracy of the LLM responses cannot be verified.
Explaining how your model thinks (chain-of-thought reasoning, for example), and citing authoritative sources, enhances transparency and builds trust.
4. Ethical use of AI is a moral imperative
AI can be misused for purposes like surveillance or profiling, which infringe on an individual’s privacy rights.
It’s imperative that organizations appoint supervisors to ensure that any AI apps used within a company align with certain ethical standards to prevent malpractice.
5. Algorithmic bias can lead to discrimination
LLMs often learn and pass on biases found in their training data, leading to unfair or discriminatory practices in the hiring of new employees or the lending of money, for example – potentially violating individual privacy rights.
Using diverse and representative datasets, implementing fairness-aware algorithms, and regularly auditing LLMs can reduce such bias.
Addressing the privacy issues listed above requires a comprehensive approach to AI data governance, combining technical, legal, and ethical strategies to ensure the responsible and secure use of this emerging technology.
AI Data Quality Challenges
Ensuring AI data quality isn’t easy. In the same survey mentioned above, we found that data quality is one of the top concerns associated with building GenAI apps. That is because data quality plays a critical role in building trust for AI apps inside organizations.
Using active retrieval-augmented generation to ground LLMs with trusted private data and knowledge is crucial. But ensuring AI data quality for LLM grounding is tricky due to the following:
- Fragmented data
Enterprise data is often siloed in dozens of systems. Customer data, for example, is typically fragmented across CRM, billing, customer service, interaction management, call recordings, and the list goes on. This fragmentation makes it incredibly difficult to present a real-time, reliable customer view to your LLM in order to power your customer-facing AI apps.
To overcome this challenge, companies need a robust data infrastructure capable of real-time data integration and unification, master data management, data transformation, and validation. The more fragmented the data, the harder it is to achieve AI data quality.
- Poor-quality metadata
Imagine an earth-bound translator trying to give instructions to a Martian. That’s what it feels like when AI apps encounter data with sparse metadata. Metadata is the data that describes your data. It acts as a crucial bridge between your organization’s information and your LLM’s ability to power your AI apps.
Rich metadata provides the context and understanding that an LLM needs to effectively use data to generate accurate and personalized responses. However, if your data catalog is poorly maintained, your metadata will be stale, and your AI initiatives will be ineffective.
- The quality vs. privacy tradeoff
AI data quality can be negatively impacted by privacy measures, such as data masking and access controls, which can hinder the retention of your data’s referential consistency.
Referential consistency refers to the accuracy of relationships between different data points. When anonymization techniques, like static or dynamic data masking, disrupt these relationships, your data quality suffers. Masked data is less reliable and meaningful for both your LLM and your user.
Essentially, the very measures designed to protect data privacy can inadvertently undermine the quality of the data itself and prevent generative AI from extracting valuable insights. For this reason, your AI data governance solution should assure referential consistency.
- The quality vs. strategy dilemma
Traditionally, data quality initiatives have often been lone efforts, disconnected from core business objectives and strategies. Such isolation makes it difficult to measure the impact of data quality improvements and secure the AI investments you seek. As a result, AI struggles to gain the attention it deserves.
AI apps rely on quality data to minimize AI hallucinations and generate accurate, reliable results. Such dependence creates a great opportunity to point out the benefits of AI – in terms of both privacy and quality – to secure the necessary resources for continued improvement.
The Disconnect Between Data Lakes and AI
Many organizations use Extract, Transform, Load (ETL)/Extract, Load, Transform (ELT) to ingest multi-source enterprise data into centralized data lakes that are responsible for enforcing data governance. Early AI adopters used RAG tools and LLM agents to write functions that queried the data lake to respond to all possible user prompts. The problem is the list of all possible user prompts is endless.
So, despite their advantages in scalability, accessibility, and cost, data lakes are a bad fit for AI data or RAG for the following reasons. First, sensitive data may accidentally be leaked to the LLM or to an unauthorized user. Second, the cost of cleansing and querying the data at enterprise scale is extremely high. Third, data lakes don’t jive with generative AI use cases that require clean, compliant, and current data.
Making Data AI-Ready and Governable
We were always taught to think big: big data stored in big data lakes. But the only way to make data AI-ready and governable is to think small – really small.
Imagine a data lake for one that continuously syncs a single entity’s data with your source systems, protects it to comply with your data privacy rules, and transforms it according to your data quality standards.
Now imagine millions of instantly accessible data lakes for one delivering AI personalization, at AI speed and scale, to millions of customers at the same time.
Conclusion
It takes a lot to achieve AI data governance – specifically as it relates to AI data privacy and AI data quality. But by understanding the unique challenges that come along with data privacy and quality, we can better understand the solutions as well.