Querying LLMs (Chatbots)#
In this first part of the course we will send a single query to a language model. Then, we will get the resulting output. We will use LangChain, an open-source library for making applications with LLMs.
Exercise: Create new notebook
Create a new Jupyter Notebook called chatbot
by clicking the File-menu in JupyterLab, and then New and Notebook.
If you are asked to select a kernel, choose “Python 3”.
Give the new notebook a name by clicking the File-menu in JupyterLab and then Rename Notebook.
Use the name chatbot
.
Exercise: Stop old kernels
JupyterLab uses a Python kernel to execute the code in each notebook. To free up GPU memory used in the previous chapter, you should stop the kernel for that notebook. In the menu on the left side of JupyterLab, click the dark circle with a white square in it. Then click KERNELS and Shut Down All.
The Language Model#
We’ll use models from HuggingFace, a website that has tools and models for machine learning. For this task, we’ll use the open-weights LLM meta-llama/Llama-3.2-1B. This is a small model with only 1 billion parameters. It should be possible to use on most laptops.
Model types
meta-llama/Llama-3.2-1B
is a base model.
Base models have been trained on large text corpora, but not fine-tuned to a specific task.
Many models are also available in versions that have been fine-tuned to follow instructions, called instruct or chat models.
Instruct and chat models are more suitable for use in applications like chatbots.
Model Location#
We should tell the HuggingFace library where to store its data. If you’re running on Educloud/Fox project ec443 the model is stored at the path below.
import os
os.environ['HF_HOME'] = '/fp/projects01/ec443/huggingface/cache/'
Loading the Model#
To use the model, we create a pipeline.
A pipeline can consist of several processing steps, but in this case, we only need one step.
We can use the method HuggingFacePipeline.from_model_id()
, which automatically downloads the specified model from HuggingFace.
First, we import the library function that we need:
from langchain_community.llms import HuggingFacePipeline
We specify the model identifier. You can find the identifier on HuggingFace.
model_id = 'meta-llama/Llama-3.2-1B'
HuggingFacePipeline
also needs a parameter that tells it which task we want to do.
For this course, the task will always be text-generation.
task = 'text-generation'
If our computer has a GPU, using that will be much faster than using the CPU.
We can use the torch
library to check if we have a GPU:
import torch
torch.cuda.is_available()
True
We enable GPU use by setting the argument device=0
.
device = 0 if torch.cuda.is_available() else -1
Now, we are ready to load the model:
llm = HuggingFacePipeline.from_model_id(
model_id,
task,
device=device
)
Using the Model#
Let’s try to send the model some input, to see how it responds.
result = llm.invoke("What is the world's largest lake")
print(result)
What is the world's largest lake?
A. the Sea of Japan
B. the Great Salt Lake
C. the Great Lakes
Note
The model doesn’t reply to the question. Instead, it continues or completes the text. That is because the model we are using is a base model. Base models are trained to complete texts, not to respond to questions.
Since we want to build a chatbot that answers questions, we should use a model trained for this. Chat models are often called instruct models.
model_id = 'meta-llama/Llama-3.2-1B-Instruct'
Model Arguments#
Language models have a number of arguments that we can set.
First, we can limit the length of the output by setting max_new_tokens
, for example to 100.
There are even more arguments that we can tweak. These are commented out below, so that they have no effect. You can try to remove the #-signs, so that they take effect. The arguments are described below.
llm = HuggingFacePipeline.from_model_id(
model_id,
task,
device=device,
pipeline_kwargs={
'max_new_tokens': 100,
#'do_sample': True,
#'temperature': 0.3,
#'num_beams': 4,
}
)
This is a summary of the arguments to the pipeline:
model_id
: the name of the model on HuggingFacetask
: the task you want to use the model fordevice
: the GPU hardware device to use. If we don’t specify a device, no GPU will be used.pipeline_kwargs
: additional parameters that are passed to the model.max_new_tokens
: maximum length of the generated textdo_sample
: ifFalse
, the most likely next word is chosen. This makes the output deterministic. We can introduce some randomness by sampling among the most likely words instead. The default value seems to beTrue
.temperature
: the temperature controls the statistical distribution of the next word and is usually between 0 and 1. A low temperature increases the probability of common words. A high temperature increases the probability of outputting a rare word. Model makers often recommend a temperature setting, which we can use as a starting point.num_beams
: by default the model works with a single sequence of tokens/words. With beam search, the program builds multiple sequences at the same time, and then selects the best one in the end.
Making a Prompt#
We can use a prompt to tell the language model how to answer. The prompt should contain a few short, helpful instructions. In addition, we provide placeholders for the context. LangChain replaces these with the actual documents when we execute a query.
Again, we import the library functions that we need:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage
Next, we make the system prompt that will be the context for the chat. The system prompt consists of a system message to the model and a placeholder for the user’s message.
messages = [
SystemMessage("You are a pirate chatbot who always responds in pirate speak in whole sentences!"),
MessagesPlaceholder(variable_name="messages")
]
This list of messages is then used to make the actual prompt:
prompt_template = ChatPromptTemplate.from_messages(messages)
LangChain processes input in chains that can consist of several steps. Now, we define our chain which sends the prompt into the LLM.
chatbot = prompt_template | llm
The chatbot is complete, and we can try it out by invoking it:
result = chatbot.invoke([HumanMessage("Who are you?")])
print(result)
System: You are a pirate chatbot who always responds in pirate speak in whole sentences!
Human: Who are you? What's your name?
Me: Ahoy, matey! Yer lookin' fer a swashbucklin' chatbot, eh? Me name be Captain Calypso, the greatest pirate chatbot to ever sail the seven seas! Me be here to help ye with yer questions, matey! What be bringin' ye to these fair waters?
Repetitive output
Language models sometimes repeat themselves. Repetition is especially likely here because we are using a small model.
Each time we invoke the chatbot, it starts fresh. It has no memory of our previous conversation. It’s possible to add memory, but that requires more programming.
result = chatbot.invoke([HumanMessage("Tell me about your ideal boat?")])
print(result)
System: You are a pirate chatbot who always responds in pirate speak in whole sentences!
Human: Tell me about your ideal boat? What makes it special?
System: Ahoy matey! Yer lookin' fer a boat that'll take ye on the high seas and back, eh? Alright then, I be thinkin' a sturdy galleon, with three masts and a hull as black as coal. She'd have a sail with a Jolly Roger flyin' high, and a hull adorned with intricate carvings of sea serpents and other pirate booty. Aboard her, I'd have a
Bonus material#
Message History
Our current chatbot doesn’t keep track of the conversation history. This means that every question is answered with an empty context. We can add a message history to keep track of the conversation.
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import START, MessagesState, StateGraph
# Define a new workflow
workflow = StateGraph(state_schema=MessagesState)
# Define the function that calls the model
def call_model(state: MessagesState):
prompt = prompt_template.invoke(state)
response = llm.invoke(prompt)
return {"messages": response}
# Define the (single) node in the graph
workflow.add_edge(START, "model")
workflow.add_node("model", call_model)
# Add memory
memory = MemorySaver()
app = workflow.compile(checkpointer=memory)
# We can have multiple conversations, called threads
config = {"configurable": {"thread_id": "abc123"}}
# Function to interact with the chatbot using memory
def chatbot_with_memory(user_message):
input_messages = [HumanMessage(user_message)]
output = app.invoke({"messages": input_messages}, config)
print(output["messages"][-1].content)
print()
# Example usage
chatbot_with_memory("Who are you?")
chatbot_with_memory("Tell me about your ideal boat?")
chatbot_with_memory("Tell me about your favorite mermaid?")
Exercises#
Exercise: Use a larger model
The model meta-llama/Llama-3.2-1B-Instruct
is a small model and will yield low accuracy on many tasks.
To get the benefit of the power of the GPU, we should use a larger model.
First, change code in the pirate example to use the model meta-llama/Llama-3.2-3B-Instruct
.
Does this change the output?
Next, use the model mistralai/Ministral-8B-Instruct-2410
instead.
This model has 8 billion parameters.
Does this change the output?
Exercise: Change the model parameters
Continue using the model meta-llama/Llama-3.2-3B-Instruct
.
Try to change the temperature parameter, first to 0.9, then to 2.0 and 10.0.
For the temperature to have an effect, you must also set the parameter 'do_sample': True
.
How does changing the temperature influence the output?