Querying LLMs (Chatbots)#
In this first part of the course we will send a single query to a language model. Then, we will get the resulting output. We will use LangChain, an open-source library for making applications with LLMs.
Exercise: Create new notebook
Create a new Jupyter Notebook called chatbot by clicking the File-menu in JupyterLab, and then New and Notebook.
If you are asked to select a kernel, choose “Python 3”.
Give the new notebook a name by clicking the File-menu in JupyterLab and then Rename Notebook.
Use the name chatbot.
Exercise: Stop old kernels
JupyterLab uses a Python kernel to execute the code in each notebook. To free up GPU memory used in the previous chapter, you should stop the kernel for that notebook. In the menu on the left side of JupyterLab, click the dark circle with a white square in it. Then click KERNELS and Shut Down All.
The Language Model#
We’ll use models from HuggingFace, a website that has tools and models for machine learning. For this task, we’ll use the open-weights LLM google/gemma-3-1b-pt. This is a small model with only 1 billion parameters. It should be possible to use on most laptops.
Model types
google/gemma-3-1b-pt is a base model.
Base models have been trained on large text corpora, but not fine-tuned to a specific task.
Many models are also available in versions that have been fine-tuned to follow instructions, called it, instruct or chat models.
Instruct and chat models are more suitable for use in applications like chatbots.
Model Location#
We should tell the HuggingFace library where to store its data. If you’re running on Educloud/Fox project ec443 the model is stored at the path below.
import os
os.environ['HF_HOME'] = '/fp/projects01/ec443/huggingface/cache/'
Loading the Model#
To use the model, we create a pipeline.
A pipeline can consist of several processing steps, but in this case, we only need one step.
We can use the method HuggingFacePipeline.from_model_id(), which automatically downloads the specified model from HuggingFace.
First, we import the library function that we need:
from langchain_huggingface import HuggingFacePipeline
We specify the model identifier. You can find the identifier on HuggingFace.
model_id='google/gemma-3-1b-pt'
HuggingFacePipeline also needs a parameter that tells it which task we want to do.
For this course, the task will always be text-generation.
task = 'text-generation'
If our computer has a GPU, using that will be much faster than using the CPU.
We can use the torch library to check if we have a GPU:
import torch
torch.cuda.is_available()
We enable GPU use by setting the argument device=0.
device = 0 if torch.cuda.is_available() else -1
Now, we are ready to load the model:
llm = HuggingFacePipeline.from_model_id(
model_id,
task,
device=device
)
Using the Model#
Let’s try to send the model some input, to see how it responds.
result = llm.invoke("What is the world's largest lake")
print(result)
Note
The model doesn’t reply to the question. Instead, it continues or completes the text. That is because the model we are using is a base model. Base models are trained to complete texts, not to respond to questions.
Since we want to build a chatbot that answers questions, we should use a model trained for this. Chat models are often called instruct models. We’ll use the open-weights LLM google/gemma-3-1b-it.
model_id = 'google/gemma-3-1b-it'
Model Arguments#
Language models have a number of arguments that we can set.
First, we can limit the length of the output by setting max_new_tokens, for example to 100.
There are even more arguments that we can tweak. These are commented out below, so that they have no effect. You can try to remove the #-signs, so that they take effect. The arguments are described below.
llm = HuggingFacePipeline.from_model_id(
model_id,
task,
device=device,
pipeline_kwargs={
'max_new_tokens': 100,
#'do_sample': True,
#'temperature': 0.09,
#'num_beams': 4,
}
)
This is a summary of the arguments to the pipeline:
model_id: the name of the model on HuggingFacetask: the task you want to use the model fordevice: the GPU hardware device to use. If we don’t specify a device, no GPU will be used.pipeline_kwargs: additional parameters that are passed to the model.max_new_tokens: maximum length of the generated textdo_sample: ifFalse, the most likely next word is chosen. This makes the output deterministic. We can introduce some randomness by sampling among the most likely words instead. The default value seems to beTrue.temperature: the temperature controls the statistical distribution of the next word and is usually between 0 and 1. A low temperature increases the probability of common words. A high temperature increases the probability of outputting a rare word. Model makers often recommend a temperature setting, which we can use as a starting point.num_beams: by default the model works with a single sequence of tokens/words. With beam search, the program builds multiple sequences at the same time, and then selects the best one in the end.
Making a Prompt#
We can use a prompt to tell the language model how to answer. The prompt should contain a few short, helpful instructions. In addition, we provide placeholders for the context. LangChain replaces these with the actual documents when we execute a query.
Again, we import the library functions that we need:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage
Next, we make the system prompt that will be the context for the chat. The system prompt consists of a system message to the model and a placeholder for the user’s message.
messages = [
SystemMessage("You are a learning assistant at the University of Oslo. Don't answer directly, but provide helpful hints."),
MessagesPlaceholder(variable_name="messages")
]
This list of messages is then used to make the actual prompt:
prompt_template = ChatPromptTemplate.from_messages(messages)
LangChain processes input in chains that can consist of several steps. Now, we define our chain which sends the prompt into the LLM.
chatbot = prompt_template | llm
The chatbot is complete, and we can try it out by invoking it:
result = chatbot.invoke([HumanMessage("What is Newton’s 3rd law?")])
print(result)
Repetitive output
Language models sometimes repeat themselves. Repetition is especially likely here because we are using a small model.
Each time we invoke the chatbot, it starts fresh. It has no memory of our previous conversation. It’s possible to add memory, but that requires more programming.
result = chatbot.invoke([HumanMessage("What is friction?")])
print(result)
Bonus material#
Message History
Our current chatbot doesn’t keep track of the conversation history. This means that every question is answered with an empty context. We can add a message history to keep track of the conversation.
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import START, MessagesState, StateGraph
# Define a new workflow
workflow = StateGraph(state_schema=MessagesState)
# Define the function that calls the model
def call_model(state: MessagesState):
prompt = prompt_template.invoke(state)
response = llm.invoke(prompt)
return {"messages": response}
# Define the (single) node in the graph
workflow.add_edge(START, "model")
workflow.add_node("model", call_model)
# Add memory
memory = MemorySaver()
app = workflow.compile(checkpointer=memory)
# We can have multiple conversations, called threads
config = {"configurable": {"thread_id": "abc123"}}
# Function to interact with the chatbot using memory
def chatbot_with_memory(user_message):
input_messages = [HumanMessage(user_message)]
output = app.invoke({"messages": input_messages}, config)
print(output["messages"][-1].content)
print()
# Example usage
chatbot_with_memory("Who are you?")
chatbot_with_memory("Tell me about your ideal boat?")
chatbot_with_memory("Tell me about your favorite mermaid?")
Exercises#
Exercise: Use a larger model
The model google/gemma-3-1b-it is a small model and will yield low accuracy on many tasks.
To get the benefit of the power of the GPU, we should use a larger model.
Change the code in the example above to use the model google/gemma-3-4b-it.
This model has 4 billion parameters.
Does this change the output?
Exercise: Change the model parameters
Continue using the model google/gemma-3-4b-it.
Try to change the temperature parameter, first to 0.9, then to 2.0 and 10.0.
For the temperature to have an effect, you must also set the parameter 'do_sample': True.
How does changing the temperature influence the output?