A large language model (LLM) is a type of artificial intelligence (AI) solution that can recognize and generate new content or text from existing content. It is estimated that by 2025, 50% of digital work will be automated through these LLM models. At their core, LLMs are trained on large amounts of content and data, and the architecture of LLMs primarily consists of multiple layers of neural networks, like recurrent layers, feedforward layers, embedding layers, and attention layers. These layers work together to process the input content and generate coherent and contextually relevant text. In this backdrop, the terms large language models (LLMs) and generative AI are often used interchangeably. Generative AI (GenAI), on the other hand, refers to a broader category of AI models designed to create new content that is not only text, but also includes images, audio, and video. The AI taxonomy in the context of this blog post is as shown below.
LLMs such as OpenAI’s GPT-4, Google’s Gemini, and Anthropic’s Claude have become very popular among the general internet audience, especially when consumed via easy-to-use interfaces like ChatGPT for getting fast answers to queries such as “Who is the first president of the U.S.?” However, the corporate usage of these LLM models for queries such as “What is the dollar value of the cost of poor data quality in purchase orders issued in 2022?” has been much slower. What are the reasons for this? Broadly, the possible issues fall into two main categories.
1. Data Quality Issues
LLMs such as ChatGPT and Gemini are trained on hundreds of terabytes of data from public sources like the Common Crawl, Reddit forums, Stack Overflow, Wikipedia, etc. The size of OpenAI’s GPT-3.5 model is approximately 175 billion parameters (the exact amount of training data and the number of parameters in GPT-4 have not been officially disclosed by OpenAI). This is a massive model, and it is nearly impossible to check and curate a dataset of this size for accuracy, timeliness, and relevance. This often leads to poor data quality and ultimately results in hallucinations or factually incorrect responses. This is a significant issue in any corporate application. No business would like to be associated with a solution that has even a small probability of giving an incorrect response.
However, poor data quality isn’t necessarily the sole reason for hallucinations. This would imply that if you only trained the model on high-quality data (whatever that is), hallucinations would disappear. Hallucinations are more a result of the stochastic sampling process that LLMs use to generate output. Since every token is sampled from a probability distribution, there’s always a chance that something “goes wrong.”
2. Data Security and Privacy
Data today plays a significant role in strategic decisions, product development, marketing strategies, and customer engagement. Moreover, with stringent regulations such as the General Data Protection Regulation (GDPR), the Health Insurance Portability and Accountability Act (HIPAA), and the California Consumer Privacy Act (CCPA), companies are legally required to protect personal data. Breaches can lead to severe financial penalties and damage to the company’s reputation and brand. Overall, today data is a very valuable business asset for companies. They want to protect it by keeping it private and not sharing it with everyone on the internet.
To tie this back to the example above concerning the use of LLMs in corporate settings, it’s important to note that because of these privacy concerns, much of this valuable data was not included in the LLM training data because it wasn’t publicly available. This exclusion directly impacts the scope and accuracy of the models when applied to specific corporate queries.
Let us look at how these two issues impact a simple HR chatbot powered by LLM such as GPT-4. Suppose we have a query such as “How many days of vacation do I have?” This query (known as a prompt) is typed into the HR chatbot. The HR chatbot is connected to the GPT-4 LLM via an API provided by OpenAI or Microsoft. The LLM understands the query and generates an appropriate answer. Now, the answer generated by the LLM will be based on the kind of information it has been trained on. For example, it may have seen a figure such as 20 days in a Reddit forum, 19 days in stack overflow, 18 days in a labor department website, 17 days on Wikipedia, and so on. Based on these types of information, the LLM will generate a response. For example, the response may be 18 days, and this happens to be incorrect, as the website was not updated for the last 18 months. This response will lead to a loss of trust on the part of users on the responses. Clearly deploying an LLM in this way is too risky for many companies.
What are solutions for these types of problems? There are a few ways this problem can be solved. The first one is “fine-tuning.” Fine-tuning takes the last few layers of the LLM and retrains the model on the specific corpus of data published or exposed by the company. In this example, it would be the company’s HR policy documents as vacation policies are maintained there. Basically, the earlier layers in the LLM capture the general understanding through the massive pre-training process. The final layers are responsible for the model’s specific outputs and decisions and that is where Fine-tuning comes into the picture. Retraining these “last” layers allows the LLM to adapt its understanding and responses to the new or specific task or domain.
The main benefits of fine-tuning are that it leverages the pre-trained language knowledge of the LLM as well as adding on the more domain specific knowledge. So, now the LLM can generate responses that are more relevant to the company. Further, since fine-tuning is only done for the last few layers, the LLM can effectively learn the nuances of the new task as these layers will be more specific to the given task.
However, fine-tuning is a slow, expensive, and risky process. It requires significant computational power and an expert team to carry it out. Additionally, managing the model becomes problematic when information or the source data changes, needing a repetition of the entire expensive and slow process. As a result, fine-tuning is more effective for adjusting the consistent behavior of the model (e.g., answering questions in a chat style, generating code, etc.) rather than regularly updating the model’s knowledge.
To address these challenges, researchers from Meta/Facebook developed the RAG (retrieval augmented generation) approach. RAG allows for more accurate and context-sensitive responses by integrating retrieval mechanisms with generative models.
So, how does RAG work? RAG starts by processing the content or knowledge corpus through tokenization, like how text content is preprocessed internally by an LLM. During tokenization, the text content is converted into tokens, which are then transformed into numerical vectors (or embeddings). These embeddings capture the meaning and relationships of the content. These vectors are stored in a vector database, such as Redis or Pinecone.
So, when a user submits a query, for example, “How many days of vacation do I have?” the RAG system leverages the vector (embeddings) database. The system performs a similarity search within this vector database, comparing the user’s query vector with the stored document vectors. It retrieves the most similar parts of relevant documents or data based on the vector comparison. Essentially, the top n most relevant documents or pieces of information related to “vacation” are retrieved, where n is a parameter defined in the RAG setup, often set between 5 and 10. The retrieved document chunks are then passed as context to the LLM. The query and the retrieved documents (context) are concatenated and sent to the LLM to generate the final response. A simplified version of the entire RAG process is shown below.
RAG is basically designed to leverage LLMs on your own content or data. It involves retrieving relevant content to augment the context or insights as part of the generation process. However, RAG is an evolving technology with both strengths and limitations. RAG integrates information retrieval from a dedicated, custom, and accurate knowledge base, reducing the risk of LLMs offering general or non-relevant responses. For example, when the knowledge base is tailored to a specific domain (e.g., legal documents for a law firm), RAG equips the LLM with relevant information and terminology, improving the context and accuracy of its responses.
At the same time, there are limitations associated with RAG. RAG heavily relies on the quality, accuracy, and comprehensiveness of the information stored within the knowledge base. Incomplete, inaccurate or missing information or data can lead to misleading or irrelevant retrieved data. Overall, the success of RAG hinges on quality data.
So, how are RAG models implemented? RAG has basically two key components: a retriever model and a generator model. The retriever model identifies relevant documents from a large knowledge corpus that are most likely to contain information pertinent to a given query or prompt. From this corpus, vectors (or embeddings) are generated that capture the semantic meaning of the content for a coherent and contextually accurate response. While there are multiple commercial and open-source RAG platforms in the market (LangChain, Llamaindex, Azure AI Search, Amazon Kendra, Abacus AI, and more), a typical implementation of the RAG models has five key phases.
- Training the Retriever: The retriever model is trained to encode both queries and documents into a vector database where similar vectors are also captured.
- Retrieving Documents: For a given query, the retriever model encodes the query into a vector and retrieves the top-k most similar documents from the corpus based on vector similarity.
- Training the Generator: The generator model is fine-tuned using a dataset where the inputs consist of the query and the retrieved documents, and the outputs are the desired responses. This training helps the generator learn to utilize the context provided by the retrieved documents to produce accurate and relevant responses.
- Generating Responses: During inference, for a given query, the retriever first fetches the top-k relevant documents. These documents are then fed into the generator along with the query. The generator produces a response based on the combined input of the query and the retrieved documents.
- Integration and Optimization: The retriever and generator are integrated into a single pipeline where the output of the retriever directly feeds into the generator. In this phase, the retriever and generator could be even trained jointly to optimize the overall system performance.
By effectively combining retrieval-based and generation-based approaches, RAG addresses many of the limitations inherent in standalone LLM models. This hybrid technique enhances the model’s ability to generate more accurate, relevant, and contextually rich responses by leveraging large-scale, diverse datasets during the retrieval phase. At the same time, RAG itself is a highly dynamic field with lots of promising areas of research. For example, a combination of RAG with knowledge graphs appears to provide even more quality responses, especially on complex enterprise data.
By bridging the gap between retrieval and generation, RAG sets a new standard for intelligent, context-insights for more nuanced and impactful applications in the generative AI age.
References
- springsapps.com/knowledge/large-language-model-statistics-and-numbers-2024
- ai.meta.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/
- medium.com/@shaileydash/rag-or-retrieval-augmented-generation-simplified-5823a9257856
- blog.tobiaszwingmann.com/p/demystifying-ai-practical-guide-key-terminology
- arxiv.org/abs/2404.17723