Retrieval-Augmented Generation#

Retrieval-Augmented Generation (RAG) is a method for including (parts of) matching documents as context for questions to a Large Language Model (LLM). This can help reduce hallucinations and wrong answers. A system for RAG has two major parts: a document database with a search index and a large language model. The figure below shows the structure of our RAG program.

RAG flowchart

When the user asks a question, the question is handled in two stages. First, the question is used as a search query for the document database. The search results are then sent together with the question to the LLM. The LLM is prompted to answer the question based on the context in the search results.

We will use LangChain, an open-source library for making applications with LLMs. This chapter was inspired by the article Retrieval-Augmented Generation (RAG) with open-source Hugging Face LLMs using LangChain.

Exercise: Create new notebook

Create a new Jupyter Notebook called RAG by clicking the File-menu in JupyterLab, and then New and Notebook. If you are asked to select a kernel, choose “Python 3”. Give the new notebook a name by clicking the File-menu in JupyterLab and then Rename Notebook. Use the name RAG.

Exercise: Stop old kernels

JupyterLab uses a Python kernel to execute the code in each notebook. To free up GPU memory used in the previous chapter, you should stop the kernel for that notebook. In the menu on the left side of JupyterLab, click the dark circle with a white square in it. Then click KERNELS and Shut Down All.

Document location#

We have collected some papers licensed with a Creative Commons license. We will try to load all the documents in the folder defined below. If you prefer, you can change this to a different folder name.

document_folder = '/fp/projects01/ec443/documents'

The Language Model#

We’ll use models from HuggingFace, a website that has tools and models for machine learning. We’ll use the open-weights LLM meta-llama/Llama-3.2-3B-Instruct, because it is small enough that we can use it with the smallest GPUs on Fox. If you run on a GPU with more memory, you can get better results with a larger model, such as mistralai/Ministral-8B-Instruct-2410.

Model Storage Location#

We must download the model we want to use. Because of the requirements mentioned above, we run our program on the Fox high-performance computer at UiO. We must set the location where our program should store the models that we download from HuggingFace:

import os
os.environ['HF_HOME'] = '/fp/projects01/ec443/huggingface/cache/'

Note

If you run the program locally on your own computer, you might not need to set HF_HOME.

The Model#

Now, we are ready to download and use the model. To use the model, we create a pipeline. A pipeline can consist of several processing steps, but in this case, we only need one step. We can use the method HuggingFacePipeline.from_model_id(), which automatically downloads the specified model from HuggingFace.

from langchain_community.llms import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id='meta-llama/Llama-3.2-3B-Instruct',
    task='text-generation',
    device=0,
    pipeline_kwargs={
        'max_new_tokens': 500,
        'do_sample': True,
        'temperature': 0.3,
        'num_beams': 4
    }
)

Pipeline Arguments

We give some arguments to the pipeline:

  • model_id: the name of the model on HuggingFace

  • task: the task you want to use the model for, other alternatives are translation and summarization

  • device: the GPU hardware device to use. If we don’t specify a device, no GPU will be used.

  • pipeline_kwargs: additional parameters that are passed to the model.

    • max_new_tokens: maximum length of the generated text

    • do_sample: by default, the most likely next word is chosen. This makes the output deterministic. We can introduce some randomness by sampling among the most likely words instead.

    • temperature: the temperature controls the statistical distribution of the next word and is usually between 0 and 1. A low temperature increases the probability of common words. A high temperature increases the probability of outputting a rare word. Model makers often recommend a temperature setting, which we can use as a starting point.

    • num_beams: by default the model works with a single sequence of tokens/words. With beam search, the program builds multiple sequences at the same time, and then selects the best one in the end.

Tip

If you’re working on a computer with less memory, you might need to try a smaller model. You can try for example mistralai/Mistral-7B-Instruct-v0.3 or meta-llama/Llama-3.2-1B-Instruct. The latter has only 1 billion parameters, and might be possible to use on a laptop, depending on how much memory it has.

Using the Language Model#

Now, the language model is ready to use. Let’s try to use only the language model without RAG. We can send it a query:

query = 'What are the major contributions of the Trivandrum Observatory?'
output = llm.invoke(query)
print(output)

This answer was generated based only on the information contained in the language model. To improve the accuracy of the answer, we can provide the language model with additional context for our query. To do that, we must load our document collection.

The Vectorizer#

Text must be vectorized before it can be processed. Our HuggingFace pipeline will do that automatically for the large language model. But we must make a vectorizer for the search index for our documents database. We use a vectorizer called a word embedding model from HuggingFace. Again, the HuggingFace library will automatically download the model.

from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(
    model_name='BAAI/bge-m3',
    model_kwargs = {'device': 'cuda:0'},
    #or: model_kwargs={'device':'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)

Embeddings Arguments

These are the arguments to the embedding model:

  • ‘model_name’: the name of the model on HuggingFace

  • ‘device’: the hardware device to use, either a GPU or CPU

  • ‘normalize_embeddings’: embeddings can have different magnitudes. Normalizing the embeddings makes their magnitudes equal.

Loading the Documents#

We use DirectoryLoader from LangChain to load all in files in document_folder. documents_folder is defined above.

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(document_folder)
documents = loader.load()

The document loader loads each file as a separate document. We can check how long our documents are. For example, we can use the function max() to find the length of the longest document.

print(f'Number of documents:', len(documents))
print('Maximum document length: ', max([len(doc.page_content) for doc in documents]))

We can examine one of the documents:

print(documents[0])

Splitting the Documents#

Since we are only using PDFs with quite short pages, we can use them as they are. Other, longer documents, for example the documents or webpages, we might need to split into chunks. We can use a text splitter from LangChain to split documents.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700, #  Could be more, for larger models like mistralai/Ministral-8B-Instruct-2410
    chunk_overlap  = 200,
)
documents = text_splitter.split_documents(documents)

Text Splitter Arguments

These are the arguments to the text splitter:

  • ‘chunk_size’: the number of tokens in each chunk. Not necessarily the same as the number of words.

  • ‘chunk_overlap’: the number of tokens that are included in both chunks where the text is split.

We can check if the maximum document length has changed:

print(f'Number of documents:', len(documents))
print('Maximum document length: ', max([len(doc.page_content) for doc in documents]))

The Document Index#

Next, we make a search index for our documents. We will use this index for the retrieval part of ‘Retrieval-Augmented Generation’. We use the open-source library FAISS (Facebook AI Similarity Search) through LangChain.

from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(documents, huggingface_embeddings)

FAISS can find documents that match a search query:

relevant_documents = vectorstore.similarity_search(query)
print(f'Number of documents found: {len(relevant_documents)}')

We can display the first document:

print(relevant_documents[0].page_content)

For our RAG application we need to access the search engine through an interface called a retriever:

retriever = vectorstore.as_retriever(search_kwargs={'k': 3})

Retriever Arguments

These are the arguments to the retriever:

  • ‘k’: the number of documents to return (kNN search)

Making a Prompt#

We can use a prompt to tell the language model how to answer. The prompt should contain a few short, helpful instructions. In addition, we provide placeholders for the context and the question. LangChain replaces these with the actual context and question when we execute a query.

from langchain.prompts import PromptTemplate

prompt_template = '''You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
Context: {context}

Question: {input}

Answer:
'''

prompt = PromptTemplate(template=prompt_template,
                        input_variables=['context', 'input'])

Making the «Chatbot»#

Now we can use the module create_retrieval_chain from LangChain to make an agent for answering questions, a «chatbot».

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

combine_documents_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, combine_documents_chain)

Asking the «Chatbot»#

Now, we can send our query to the chatbot.

result = rag_chain.invoke({'input': query})
print(result['answer'])

Hopefully, this answer contains information from the context that wasn’t in the previous answer, when we queried only the language model without RAG.

Exercises#

Exercise: Use your own documents

Change the document location to your own documents folder. You can also upload more documents that you want to try with RAG. Change the query to a question that can be answered based on your documents. Try to the run the query and evaluate the answer.

Exercise: Saving the document index

The document index that we created with FAISS is only stored in memory. To avoid having to reindex the documents every time we load the notebook, we can save the index. Try to use the function vectorstore.save_local() to save the index. Then, you can load the index from file using the function FAISS.load_local(). See the documentation of the FAISS module in LangChain for further details.

Exercise: Slurm Jobs

When you have made a program that works, it’s more efficient to run the program as a batch job than in JupyterLab. This is because a JupyterLab session reserves a GPU all the time, also when you’re not running computations. Therefore, you should save your finished program as a regular Python program that you can schedule as a job.

You can save your code by clicking the “File”-menu in JupyterLab, click on “Save and Export Notebook As…” and then click “Executable Script”. The result is the Python file RAG.py that is downloaded to your local computer. You will also need to download the slurm script LLM.slurm.

Upload both the Python file RAG.py and the slurm script LLM.slurm to Fox. Then, start the job with this command:

sbatch LLM.slurm RAG.py

Slurm creates a log file for each job which is stored with a name like slurm-1358473.out. By default, these log files are stored in the current working directory where you run the sbatch command. If you want to store the log files somewhere else, you can add a line like below to your slurm script. Remember to change the username.

#SBATCH --output=/fp/projects01/ec443/<username>/logs/slurm-%j.out