Building intelligent Q&A chat systems with RAG, LangChain, and ChromaDB: a step-by-step guide

In our previous article on semantic search, we explored the creation of embeddings to rank documents based on their relevancy to a given query phrase. However, semantic search has broader applications beyond document ranking. It can serve as a mechanism to connect Language Models (LLMs) to knowledge bases, enabling them to analyze and leverage external data sources.

By default, general LLMs only possess information from their training datasets. However, if we provide them with additional data, they can analyze it and uncover hidden connections between different pieces of information. LLMs can also be used to:

  • Analyze data and find hidden connections: By feeding LLMs with external data, they can uncover relationships and patterns between different pieces of information.
  • Summarize large amounts of information: LLMs have the capability to summarize large volumes of data, extracting specific and relevant details.

What is RAG?

RAG, short for Retrieval-Augmented Generation, is a powerful framework that combines retrieval models and language generation models. It utilizes AI to retrieve documents from a given resource based on the relevancy of a specific context. RAG provides the capability to augment the generation process by leveraging retrieval models for enhanced responses in conversational chat applications.

The key idea behind RAG is to use retrieval models to retrieve relevant documents, and to use language generation models to generate responses based on those documents. By incorporating a retrieval mechanism, RAG can provide more accurate and contextually relevant responses in conversational systems.

Some key use cases of RAG include:

  • Relevant book recommendations for students: Students can leverage RAG to obtain a curated list of relevant books for their case studies, research papers, or learning materials.
  • Q&A-based chat for employees: Employees can use a RAG-powered question-and-answer chat system to retrieve specific documents, policies or company standards.

However, it's essential to understand that LLMs themselves cannot access external resources or perform operations beyond generating text. This is where the LangChain library comes in. LangChain enables interaction between LLMs and external systems, allowing for seamless switching of context between the LLM and our software.

What is LangChain?

LangChain is a communication layer between our software and an LLM. It is a set of tools and functions that allow us to:

  • Quickly bootstrap LLM: LangChain provides an easy way to bootstrap LLM and use it as a generative model for conversational chat applications.
  • Create a chain of prompts: LangChain allows us to create a chain of prompts that go back and forth between the LLM and our software. The chain of prompts is a sequence of prompts that take some input from the LLM, perform some operations, and then pass it back to the LLM with augmented data.
  • Communicate between software and LLM: LangChain allows us to send back information to our software and automatically process it. Further, you can send it back to LLM for subsequent processing.

The chain of prompts helps to alleviate some of the weaknesses of LLMs. By adding prompts that can perform specific operations on the input, we can channel the dialog in a more structured way, while also enriching the data being fed to the LLM.

Now let's discuss certain limitations of large language models (LLMs) and understand the techniques we can utilize to overcome those limitations.

Adding memory to LLM and understanding context windows

LLMs are stateless, which means they do not have the ability to remember previous messages or maintain short-term memory like humans do. To overcome this limitation, we can implement context windows and add memory to LLMs.

Context windows involve providing the LLM with the entire conversation history, including past user inputs and model responses. Feeding back this context, the LLM can generate more accurate and relevant responses to ongoing conversations.

Context Memory limitation

While adding memory to LLMs using context windows is a significant step towards improving their performance, it is essential to take into account the limitations imposed by the number of tokens allowed in a single request. Limitations on token* count exist to ensure the efficient processing of large language models.

For instance, models like gpt-3.5-turbo and gpt-3.5-turbo-16k have token limits of 4096 and 16,385 tokens, respectively*. On the other hand, gpt-4 has a token limit of 8132.

As the amount of context increases, the processing becomes slower and more resource-intensive. Therefore, merely adding an infinite amount of text is not feasible within these limitations.

To address this challenge, it is important to summarize the chat history and include only the most relevant messages. For example, if a chat contains 100 messages, it is pragmatic to include only the last 20 messages to ensure that the token count remains within the allowable limits.

Note

Generally, 1 token can represent approximately 4 English characters. 4096 Tokens are roughly equivalent to 16,384 characters.

Step-by-step guide: building a conversational chat app with RAG and LangChain

In this guide, we will walk through the process of building a chat application where users can ask questions about Greek and Roman myths based on the book "Stories of Old Greece and Rome" by Emilie K. Baker.

Before we can proceed with building the application, there are several steps we need to follow. The first step involves preparing our data. Similar to the approach described in the ChromaDB semantic search article, we need to convert our data into documents and store them in Chroma.

Prerequisites

Before we dive into the code, there are a few packages we need to install. These packages are @langchain/core, @langchain/openai, and @langchain/community.

Here's a brief summary of what each package does:

@langchain/core: adds the core methods of LangChain.

@langchain/openai: This package bundles classes related to the OpenAI API, which we will use for automatic embedding generation.

@langchain/community: This package contains helper classes related to various things created by the community, including vector store utilities such as the Chroma utility method that we will use in our application.

To install these packages, you can run the following command in your terminal:

npm install @langchain/core @langchain/community @langchain/openai
# or yarn 
yarn add @langchain/core @langchain/community @langchain/openai

Once the packages are installed, we can proceed to the next step of our application.

Step 1: Preparing the data

To begin with, we need to split the large plain text file into smaller, more manageable documents. It is necessary as a single document can not accommodate the entire book. To achieve this, we will utilize the TextLoader and RecursiveCharacterTextSplitter.

// Import necessary modules for file path handling and text processing.
import path from "path";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

// Asynchronous function to load and split a text document.
async function getDocuments() {
  // Create the full file path for 'greek_and_roman_myths.txt'.
  const pathToDocument = path.join(
    process.cwd(),
    "src/assets/docs/greek_and_roman_myths.txt"
  );

  // Initialize TextLoader to load the document.
  const loader = new TextLoader(pathToDocument);
  const docs = await loader.load(); // Load the document.

  // Initialize RecursiveCharacterTextSplitter to split the text into chunks.
  const textSplitter = new RecursiveCharacterTextSplitter({
    chunkSize: 3000,
    chunkOverlap: 200,
  });

  // Return the split text chunks.
  return textSplitter.splitDocuments(docs);
}

The RecursiveCharacterTextSplitter is a utility class that helps us split plain text files into smaller documents for our chat application. We can adjust two options: chunkSize and chunkOverlap.

To control the size of each document, we use the chunkSize option. By specifying the desired size, we can make the chunks more manageable for further processing.

Preserving context is important when splitting documents. The chunkOverlap option helps with this. It adds a portion from the end and beginning of adjacent chunks to maintain the necessary context within each document.

To find the best configuration, I suggest experimenting with different values for chunkSize and chunkOverlap. It will help us determine the optimal chunk size based on the formatting and connection strength between sentence contexts.

Let me know if you have any other questions or if there's anything else I can assist you with.

Step 2. Loading items into ChromaDB

We have successfully prepared our documents, and now it's time to store them in ChromaDB so that our chat application can access and retrieve them efficiently.

First, we need to establish a connection with ChromaDB and load our documents.

We can achieve this easily using LangChain’s Chroma class from @langchain/community package as follows:

import { Chroma } from "@langchain/community/vectorstores/chroma";
import { OpenAIEmbeddings } from "@langchain/openai";

function createVectorStore() {
  const COLLECTION_NAME = "documents";
  const embeddings = new OpenAIEmbeddings();
  const vectorStore = new Chroma(embeddings, {
    url: process.env.CHROMADB_PATH,
    collectionName: COLLECTION_NAME,
  });

  return vectorStore;
}

We pass the OpenAIEmbeddings instance as the first argument to the Chroma constructor. It allows us to automatically create embedding vectors for our documents. The next argument is the connection configurations of our vector store instance.

If you're interested in learning more about embeddings, I recommend referring to the semantic search article, where we delve deeper into the concept. There we use a plain ChromaDB package without LangChain. In a nutshell, embeddings are vector representations that capture the semantic meaning of text data.

Here is the full code of the script responsible for preparing our data for further operations in our chat application. We will name this file index-docs.js.

// Load environment variables from a .env file.
import "dotenv/config";

// Import modules for file path handling and text processing.
import path from "path";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

// Import modules for vector storage and OpenAI embeddings.
import { Chroma } from "@langchain/community/vectorstores/chroma";
import { OpenAIEmbeddings } from "@langchain/openai";

// Asynchronously load and split a document into chunks.
async function getDocuments() {
  // Construct the full path to the document.
  const pathToDocument = path.join(
    process.cwd(),
    "src/assets/docs/greek_and_roman_myths.txt"
  );

  // Load the document using TextLoader.
  const loader = new TextLoader(pathToDocument);
  const docs = await loader.load();

  // Split the document into smaller parts with RecursiveCharacterTextSplitter.
  const textSplitter = new RecursiveCharacterTextSplitter({
    chunkSize: 3000,
    chunkOverlap: 200,
  });

  // Split and return the document.
  const splitDocs = await textSplitter.splitDocuments(docs);
  return splitDocs;
}

// Create a vector store for document handling.
function createVectorStore() {
  // Define the collection name for the documents.
  const COLLECTION_NAME = "documents";

  // Initialize OpenAI embeddings.
  const embeddings = new OpenAIEmbeddings();

  // Create and return a new Chroma vector store with specified settings.
  const vectorStore = new Chroma(embeddings, {
    url: process.env.CHROMADB_PATH,
    collectionName: COLLECTION_NAME,
  });
  return vectorStore;
}

async function build() {
  // Create a vector store and load documents.
  const vectorStore = await createVectorStore();
  const docs = await getDocuments();

  // Add the processed documents to the vector store.
  await vectorStore.addDocuments(docs);
}

// Run the build function.
build();

Step 3: Create an API route in Astro application to handle chat requests

To handle user requests in our Astro application, we need to create an API route that supports the POST method. Since the specific implementation details of the Astro framework are beyond the scope of this assistance, I recommend referring to the practical guide of the Astro framework for more detailed information.

Here is the code for the API route:

// POST API route with request handling.

export const POST: APIRoute = async ({ request }) => {
  // Parse JSON from the request.
  const body = await request.json();
  // Extract 'question' and 'history', with 'history' defaulting to empty.
  const { question, history = [] } = body;

  // Return error response if 'question' is missing.
  if (!question) {
    return new Response(JSON.stringify("Please provide query phrase"), {
      status: 403,
    });
  }

  // Get and invoke the runnable sequence with question and history.
  const chain = await getRunnableSequence();
  const result = await chain.invoke({ question, history });

  // Return the processed result as a JSON response.
  return new Response(JSON.stringify({ result }, null, 2));
};

In the provided code, notice the getRunnableSequence function. We will delve deeper into this function and its implementation details in a later section.

Step 4: RAG implementation and runnable sequences

To implement a conversational chat application using RAG (Retrieval-Augmented Generation) model, we need to follow the following steps:

Step 1: Condense question

In order to handle natural language questions from users and derive information from the chat history, we need to consider a scenario where users may ask questions that can be deduced from the chat history. For example, instead of explicitly asking "Give me the traits of Zeus!" a user might ask "Who is Zeus?" and then follow up with "What are his traits?"

To handle this scenario, we need to pass the chat history as context and create an instruction to rephrase the follow-up question to be a standalone question. It will enable the model to generate more coherent and relevant responses based on the conversation context.

We will refer to this prompt as condenceQuestionTemplate.

Here is the placeholder for the condenceQuestionTemplate prompt:

const condenseQuestionTemplate = `
    If user asks about mythology user conversational history 
    Given the following conversation and a follow up question, 
    rephrase the follow up question to be a standalone question.

    Chat History:
    {chat_history}
    Follow Up Input: {question}
    Standalone question:
  `;

  const CONDENSE_QUESTION_PROMPT = PromptTemplate.fromTemplate(
    condenseQuestionTemplate
  );

The PromptTemplate is a component provided by @langchain/core. LangChain offers multiple methods for creating prompts, and it generates specific objects in the background that seamlessly interact with other LangChain like chains and agents.

Note

In simple terms, chains are a combination of prompt and LLM models, while agents are responsible for managing the overall chat flow through the different chains and actions. They serve as the main control center of the chat bot.

Step 2: Creating an answering prompt to analyze and answer the text document

In this step, we will create an answering prompt to analyze the text document that we have provided, and generate a response to the specific question that we narrowed down in our previous example.

const answerTemplate = `
    You should be nice to user and provide short, witty but comprehensive answers.

    Answer the question based only on the given context

    Step 1. Find the relevant answer baed on the DOCUMENT

    Step 2. Format in a readable, user friendly markdown format.

    DOCUMENT:
    --------
    {context}

    Question: 
    ---------
    {question}
`;

const ANSWER_PROMPT = PromptTemplate.fromTemplate(answerTemplate);

Note that we use curly braces for {context} and {question} as part of LangChain's template replacement convention. It is particularly important because we will be creating a pipeline of chains that will automatically fill in the necessary inputs.

Step 3: Combine all RAG components together and give it a try!

A crucial concept within LangChain is the use of Runnables. Runnables allow us to combine multiple LangChain components, such as LLM models, prompts, and process inputs, as well as multiple chains themselves, in a sequential manner using pipes.

Since this concept is essential to understand and implement successfully, I highly recommend referring to the official LangChain documentation for further details and examples.

So now, let's summarize what we should do.

To achieve the desired results, just follow these simple steps:

  • Pass the input question to the condenseQuestionTemplate.
  • Utilize a retriever to search for the relevant document using semantic search from Chroma.
  • Feed the document and the condensed question into the answerTemplate prompt to obtain the final results.

Now, similar to our index-docs.js script, we need to initialize a vector store using fromExistingCollection, which directly loads documents from our desired collection. In a nutshell, this is the same process as in index-docs.js , however, LangChain provides multiple ways to achieve the same thing. Next, we create a retriever from our database and configure it to retrieve the last 5 documents to be included in our context.

It is important to note that it is not recommended to include a large number of documents in the retriever as it may exceed the limits of our context window. It is advised to adjust this based on your specific needs and the nature of your data.

const vectorStore = await Chroma.fromExistingCollection(
    new OpenAIEmbeddings({ openAIApiKey: import.meta.env.OPENAI_API_KEY }),
    {
      collectionName: "documents",
      url: import.meta.env.CHROMADB_PATH,
    }
);

const retriever = vectorStore.asRetriever(20);

Since we have already created a retriever, let's combine everything into a Runnable sequence.

In the code snippet below, our sequence consists of the following steps:

  1. Format the input variables, specifically transforming the history into a string.
  2. Pass the CONFIDENCE_QUESTION_PROMPT, which is a sentence from the PromptTemplate.
  3. Use our model, which utilizes the gpt-3.5 model, and set the verbose flag to true in order to debug and understand how it works behind the scenes.
  4. Lastly, pass StringOutputParser from @langchain/core/output_parsers, which directly parses our input into a string instead of the LangChain output object. It is important to note that it does not process the actual LLM output, but rather converts the LangChain returned object, so we get a string output.
function formatChatHistory(chatHistory: ChatMessage[]) {
  const formattedDialogueTurns = chatHistory.map((message) => {
    return `${message.type}: ${message.content}`;
  });

  return formattedDialogueTurns.join("\n");
}

const model = new ChatOpenAI({
    modelName: "gpt-3.5-turbo-1106",
    openAIApiKey: import.meta.env.OPENAI_API_KEY,
    verbose: true,
});

const standaloneQuestionChain = RunnableSequence.from([
    {
      question: (input) => input.question,
      chat_history: (input) => formatChatHistory(input.history),
    },
    CONDENSE_QUESTION_PROMPT,
    model,
    new StringOutputParser(),
]);

Furthermore, we need to create a Runnable called answerChain which will retrieve the document based on the condensed question and pass it to the answer prompt. As you can see, we are using Runnable interface to retrieve our document and set it to context input variable of our prompt.

const answerChain = RunnableSequence.from([
    {
      context: retriever.pipe(formatDocumentsAsString),
      question: new RunnablePassthrough(),
    },
    ANSWER_PROMPT,
    model,
    new StringOutputParser(),
  ]);

And lastly, to connect both Runnable sequences together, we need to add our secret sauce of the LangChain - piping. LangChain utilizes a piping interface, which is handy when you need to combine multiple chains together.

const chain = standaloneQuestionChain.pipe(answerChain);

Here is the full implementation of getRunnableSequence method:

async function getRunnableSequence() {
  const model = new ChatOpenAI({
    modelName: "gpt-3.5-turbo-1106",
    openAIApiKey: import.meta.env.OPENAI_API_KEY,
    verbose: true,
  });

  const condenseQuestionTemplate = `
    If user asks about mythology user conversational history 
    Given the following conversation and a follow up question, 
    rephrase the follow up question to be a standalone question.

    Chat History:
    {chat_history}
    Follow Up Input: {question}
    Standalone question:
  `;

  const CONDENSE_QUESTION_PROMPT = PromptTemplate.fromTemplate(
    condenseQuestionTemplate
  );

  const answerTemplate = `
    You should be nice to user and provide short, witty but comprehensive answers.

    Answer the question based only on the given context

    Step 1. Find the relevant answer baed on the DOCUMENT

    Step 2. Format in a readable, user friendly markdown format.

    DOCUMENT:
    --------
    {context}

    Question: 
    ---------
    {question}
  `;

  const ANSWER_PROMPT = PromptTemplate.fromTemplate(answerTemplate);

  const vectorStore = await Chroma.fromExistingCollection(
    new OpenAIEmbeddings({ openAIApiKey: import.meta.env.OPENAI_API_KEY }),
    {
      collectionName: "documents",
      url: import.meta.env.CHROMADB_PATH,
    }
  );

  const retriever = vectorStore.asRetriever(20);

  const standaloneQuestionChain = RunnableSequence.from([
    {
      question: (input) => input.question,
      chat_history: (input) => formatChatHistory(input.history),
    },
    CONDENSE_QUESTION_PROMPT,
    model,
    new StringOutputParser(),
  ]);

  const answerChain = RunnableSequence.from([
    {
      context: retriever.pipe(formatDocumentsAsString),
      question: new RunnablePassthrough(),
    },
    ANSWER_PROMPT,
    model,
    new StringOutputParser(),
  ]);

  const chain = standaloneQuestionChain.pipe(answerChain);

  return chain;
}

Now, let's run it on the application and see how it looks! I've already prepared a nice UI for us to quickly test our application, so we can get started right away.

[@portabletext/react] Unknown block type "video", specify a component for it in the `components.types` prop

Moreover, since we set a verbose flag in our ChatOpenAI model, we can observe in terminal the steps that LangChain does in order to get our response.

[@portabletext/react] Unknown block type "video", specify a component for it in the `components.types` prop

You can check out the demo and also dig into the project on GitHub.

Where RAG can be applied?

The RAG approach can be applied in various areas, enabling users to access and retrieve information efficiently. Here are some examples:

  • Employee benefits and policies:
    • Employees can use the chat Q&A feature to get information about their company's benefits, such as healthcare coverage, retirement plans, or vacation policies.
    • They can also seek information on legal matters, such as employment laws, workplace safety regulations, or discrimination policies.
  • Education and learning:
    • Students can leverage the chat Q&A function to quickly analyze their study materials and extract key points. This can help them grasp important concepts more efficiently.
    • They can also ask questions about specific subjects or topics, seeking clarification or additional explanations from their learning resources
  • Product information and support:
    • Consumers can utilize the chat Q&A feature to access detailed information about products or services. For example, they can inquire about technical specifications, pricing, or availability.
    • They can also seek support and troubleshooting assistance by asking questions related to common product issues or usage scenarios.
  • Health and wellness:
    • Individuals can use chat Q&A to obtain information about their health, including symptoms, diseases, or available treatment options.
    • They can seek advice on maintaining a healthy lifestyle, such as nutrition tips, exercise routines, or stress management techniques.
  • Travel and tourism:
    • Tourists can utilize the chat Q&A function to gather information about travel destinations, attractions, or local customs and traditions.
    • They can ask questions regarding visa requirements, transportation options, accommodation recommendations, or popular tourist spots.

Summary

Let's summarize what we have learned in our today's journey:

  • We have gained an understanding of the relevance of the RAG approach in semantic search and Language Models (LLMs).
  • We have explored the limitations of LLMs, such as memory constraints and context window restrictions, and discovered solutions like incorporating chat history and setting message limits.
  • We have introduced LangChain as a platform that enables seamless context switching between AI and our software, enhancing the conversational chat experience.
  • We have created a step-by-step example showcasing how to ask questions using a mythology book, demonstrating the efficiency of RAG.
  • We have identified various domains where the RAG approach can be applied, including employee benefits and education.

In conclusion, I encourage you to delve further into LangChain as it combines various methods of interaction with LLMs. This understanding will shed light on how to create software based on general artificial intelligence (Gen AI) principles, opening up new possibilities and advancements in the field of conversational AI.

LINK TO DEMO
LINK TO GITHUB REPOSITORY