Session 3: Understanding Promptfoo Evaluations
Goal: Explore a real evaluation config (MBU Databeskyttelse) and understand how to read, run, and modify it
What We’ll Do Today
- ✅ Locate and open the MBU evaluation config
- ✅ Understand the config structure and each section
- ✅ Learn about the custom OWUI provider (vs. the HTTP provider from Session 2)
- ✅ Walk through different test patterns and assertion types
- ✅ Run the evaluation and interpret results
- ✅ Add your own test case
Part 1: The MBU Evaluation Config
Background: Where to Find It
The MBU Databeskyttelse evaluation is located in the prompfoo-docker github repository:
eval-configs/MBU-Databeskyttelse/promptfooconfig.yaml
This evaluation tests the MBU Databeskyttelse assistant with real questions from users. It checks:
- Answer accuracy
- Document retrieval quality
- Content retrieval quality
- Proper refusals for out-of-scope questions
A shortened version (leaving out many of the test-cases) can be found here and can be used for testing in this guide.
Opening the Config
Download the shortened version to a project directory and open the directory in VSCode and open the file as well:
Part 2: Config Structure Walkthrough
Let’s break down each section of the config.
1. Description
description: "Test MBU Databeskyttelses AI"
Simple text describing what this evaluation tests.
2. Prompts
prompts:
- "{{question}}"
The prompt template. Here we send the question directly to the LLM. The {{question}} variable gets replaced with actual test questions from the tests section.
There could have been some “prompt-text” around the variable to create some template, but the “model” (provider) we are going to evaluate already in itself adds a system prompt and some reference material. By keeping the prompt template as clean as above, we simulate a user in the openwebUI chat interface writing their initial message to the “model” (chat assistant).
3. Providers
This is where we use our custom OWUI provider instead of the raw HTTP provider from Session 2.
providers:
- id: file:///app/providers/owui.js
label: "DEV - MBU databeskyttelsesassistent"
config:
apiEndpointEnvironmentVariable: "DEV_OWUI_ENDPOINT"
apiKeyEnvironmentVariable: "DEV_OWUI_API_KEY"
model: "databeskyttelse-mbu"
outputSources: true
- id: file:///app/providers/owui.js
label: "STG - MBU databeskyttelsesassistent"
config:
apiEndpointEnvironmentVariable: "STG_OWUI_ENDPOINT"
apiKeyEnvironmentVariable: "STG_OWUI_API_KEY"
model: "databeskyttelse-mbu"
outputSources: true
Key differences from Session 2 (HTTP provider):
| Feature | HTTP Provider (Session 2) | OWUI Provider (Session 3) |
|---|---|---|
| Config | Manual url, headers, body | Simplified config with env variable names |
| API Keys | Exposed in Authorization header | Read securely from .env file |
| Sources | Needs custom implementation to be handled | outputSources: true enables RAG sources |
| Response | The full json object or after transformResponse just Simple text | Object with text and sources |
Download the custom provider
The custom provider owui.js is kept updated in ITK Dev’s prompfoo-docker Github repository, a copy of the provider file is available here for download (note this provider might not reflect the current provider in the prompfoo-docker repo, but it will work with the shortened version of the MBU Databeskyttelses config file).
Download the owui.js file to the same directory where short-mbu-databeskyttelses-eval-config.yaml is stored.
Now update the path string of id in the config, so that the providers list looks like:
providers:
- id: file://owui.js # updated path
label: "DEV - MBU databeskyttelsesassistent"
config:
apiEndpointEnvironmentVariable: "DEV_OWUI_ENDPOINT"
apiKeyEnvironmentVariable: "DEV_OWUI_API_KEY"
model: "databeskyttelse-mbu"
outputSources: true
- id: file://owui.js # updated path
label: "STG - MBU databeskyttelsesassistent"
config:
apiEndpointEnvironmentVariable: "STG_OWUI_ENDPOINT"
apiKeyEnvironmentVariable: "STG_OWUI_API_KEY"
model: "databeskyttelse-mbu"
outputSources: true
Alternative: You could also place the owui.js file in a folder called e.g. providers, then the ìd-path would be file://providers/owui.js
Config options explained:
apiEndpointEnvironmentVariable: Name of the env variable containing the API endpoint URLapiKeyEnvironmentVariable: Name of the env variable containing the API keymodel: The model name in OWUI (e.g.,databeskyttelse-mbu)outputSources: true: Critical! This tells the provider to include RAG sources in the output
The OWUI provider is a custom provider written specifically for our setup. It handles the complexity of API calls securely. See the OWUI Provider documentation for details.
4. Assertion Templates
assertionTemplates:
refuseToAnswer:
type: regex
value: '.*kontakt(?:er|e)?\s+databeskyttelse@mbu\.aarhus\.dk.*'
transform: "output.text"
metric: refusal
Reusable patterns that can be referenced in multiple tests using $ref. This template checks if the answer tells the user to contact the data protection team.
Background
The
metrickey is used to group the different assertions. Promptfoo then automatically calculates a passing rate for each metric.Names for metrics can be chosen freely. In this in evaluation config I have chosen to use the metrics
- “refusal”: Grouping the assertions testing, if the answer to the user query is a refusal to answer
- “answer”: Grouping the assertions testing that the answer given by the system corresponds to an ideal answer
- “docRetrieval”: Grouping the assertions testing that the correct document(s) are retrieved
- “contentRetrieval”: Grouping the assertions testing that the correct content (from the correct documents) are indeed retrieved by the system
5. Default Test
defaultTest:
options:
rubricPrompt: |
[
{
"role": "system",
"content": "Du vurderer om givne svar er fyldestgørende og korrekte i forhold til ideelle ekspert svar. Svar JSON format: {\"reason\": \"string\", \"pass\": boolean, \"score\": number}. ALLE begrundelser skal være på dansk."
},
{
"role": "user",
"content": "<Givet svar>\n{{ output }}\n</Givet svar>\n\n<Ideelt svar>\n{{ rubric }}\n</Ideelt svar>"
}
]
provider:
id: file:///app/providers/owui.js
label: "judge"
config:
apiEndpointEnvironmentVariable: "DEV_OWUI_ENDPOINT"
apiKeyEnvironmentVariable: "DEV_OWUI_API_KEY"
model: "AarhusAI-default"
outputSources: false
config:
pythonExecutable: .venv/bin/python
Settings that apply to ALL tests:
rubricPrompt: The prompt used forllm-rubricassertions (in Danish!) - this defines how the LLM-as-judge evaluates answers-
provider: Which LLM to use for judging answers (here:AarhusAI-default)IMPORTANT
The provider id path also needs to be updated here to reflect the changes in where the custom owui.js provider is located from Download the custom provider
IMPORTANT 2
In the config the environment variables are set up with the
DEV_variables. These have not (yet) been configured in this training setup, but the same models are available through the staging site, thus change the references to enviroment variables such that:DEV_OWUI_ENDPOINT->STG_OWUI_ENDPOINTDEV_OWUI_API_KEY->STG_OWUI_API_KEY
-
config.pythonExecutable: Path to Python, needed for running custom Python assertions.TODO:
In order for this to work, we need to setup a python environment. We will set this up before trying to run the evaluations, see set up python
These defaults can be overridden in individual tests if needed.
Part 3: Test Examples Walkthrough
Let’s examine different test patterns from the MBU config.
Pattern 1: Refusal Test
- vars:
question: Hvornår må vi sende klasselister ud til nye 0-klasser
assert:
- $ref: "#/assertionTemplates/refuseToAnswer"
- type: python
value: file:///app/assertions/refusal_assertion.py
transform: "output.text"
metric: refusal
metadata:
origin: synthetic
refusal: yes
What this tests:
- Question is out-of-scope (MBU doesn’t know when to send classlists)
- Answer should refuse and mention
databeskyttelse@mbu.aarhus.dk - Uses
$refto reuse the regex template - Also uses custom
refusal_assertion.pyto validate proper refusal format
The python assertion also adds to the “refusal” metric explained when the assertion templates are introduce in Assertion templates
Metadata:
Metadata can be used to filter the various tests
origin: synthetic= we created this test (not from real user)refusal: yes= this is a refusal test
Download the custom python assertions
The custom assertion refusal_assertion.py is kept updated in ITK Dev’s prompfoo-docker Github repository together with the other custom assertions, a copies of the custom assertions are available here for download:
(note these custom assertions might not reflect the current custom assertions in the prompfoo-docker repo, but these assertions will work with the shortened version of the MBU Databeskyttelses config file, that we downloaded as the first thing in this session).
Download the all the assertions in the list to the same directory where short-mbu-databeskyttelses-eval-config.yaml is stored.
Now update the path string of value in the config, so that the assertion in the test case looks like:
type: python
value: file://refusal_assertion.py
transform: "output.text"
metric: refusal
Alternative: You could also place the refusal_assertion.py file (along with the other custom assertions) in a folder called e.g. assertions, then the value-path would be file://assertions/refusal_assertion.py
Pattern 2: Answer + Document Retrieval + Content Retrieval
- vars:
question: Hvornår må nye forældre til skolestartere komme i Aula
assert:
- type: icontains-any
value:
- "31. jul"
- "august"
transform: "output.text"
metric: answer
- type: icontains
value: "FAQ om databeskyttelse i Børn og Unge.md"
transform: 'output.sources.map(o => o.reference).join(" ")'
metric: docRetrival
- type: icontains-any
value:
- "Mht. adgang til Aula, må forældrene først komme på til august"
- "styres af Pladsanvisningen som 31/7 fjerner det flueben i elevadministrationssystemet som hindrer overførsel til Aula"
transform: 'output.sources.map(o => o.content).join("/n/n")'
metric: contentRetrieval
metadata:
origin: synthetic
What this tests:
- Answer quality: Must contain “31. jul” or “august”
- Document retrieval: Must retrieve the correct document
- Content retrieval: Must retrieve the correct content chunk
Here we have examples of the remaining metrics, introduced in Assertion templates
Key concept - Understanding transform
In Session 2 (HTTP provider), we used transformResponse to extract just the text from the API response.
With the OWUI provider’s outputSources: true, the output is an object with two properties:
output.text: The answer textoutput.sources: Array of retrieved documents withreference(title) andcontent
The transform in assertions processes this output object:
output.text- just the answeroutput.sources.map(o => o.reference).join(" ")- concatenate all document titlesoutput.sources.map(o => o.content).join("\n\n")- concatenate all document content
Pattern 3: Custom Python Assertion (Essential claims)
- vars:
question: |-
vedr. klasselister til kommende skolestarter.
Hvor mange oplysninger må der stå på klasselisten?
assert:
- type: icontains
value: "fulde navn"
transform: "output.text"
metric: answer
- type: python
value: file:///app/assertions/essential_claims_assertion.py
transform: "output.text"
config:
essentialClaims:
- "fulde navn"
metric: answer
- type: icontains
value: "FAQ om databeskyttelse i Børn og Unge.md"
transform: 'output.sources.map(o => o.reference).join(" ")'
metric: docRetrival
metadata:
origin: real
What this tests:
- Answer must contain “fulde navn”
- Uses custom
essential_claims_assertion.pyto verify the answer covers required information metadata.origin: real= this is a real user question
Update the custom python assertions:
You have probably already downloaded the custom assertion essential_claims_assertion.py when going through section Download the custom python assertions.
If the essential_claims_assertion.py file is placed in the same directory where short-mbu-databeskyttelses-eval-config.yaml is stored, then the assertion is updated like
- type: python
value: file://essential_claims_assertion.py # The path is updated
transform: "output.text"
config:
essentialClaims:
- "fulde navn"
metric: answer
and if essential_claims_assertion.py is placed in e.g. the folder assertions, then the value-path would be file://assertions/essential_claims_assertion.py
Pattern 4: LLM Rubric (Quality Comparison)
- vars:
question: Må vi gerne have en elev, der har svært ved at komme i skole med på fjernundervisning?
assert:
- type: llm-rubric
value: |-
Ja det er tilladt at anvende Google Meet til fjernundervisning. Der må dog intet følsomt, fortroligt, eller stødende, indgå i transmissionen. Hvis eleven er synligt syg anbefales det er slukke for kameraet hos eleven. Der skal være tydeligt for elever i klassen og andre der kommer ind, at der kører et Google Meet. Og i hjemmet hos eleven må der ikke indgå andre i transmissionen end den syge elev. Der må ikke ske optagelse af transmissionen.
weight: 0.1
metric: answer
metadata:
origin: real
What this tests:
- Compares the answer to an ideal response using LLM-as-judge
weight: 0.1- scales the score (AarhusAI-default scores 0-10, normalized to 0-1)
Pattern 5: Danish Answer Classification (ja/nej assertion)
- vars:
question: Jeg er interesseret i at vide om det er tilladt at sende billeder til forældre via sms?
assert:
- type: python
value: file:///app/assertions/ja_nej_assertion.py
transform: "output.text"
config:
expectedAnswerCategory: Afvisende
metric: answer
metadata:
origin: real
What this tests:
- Classifies the answer as Bekræftende (affirmative), or Afvisende (refusal)
- Expects an Afvisende answer (this question should be refused)
- Uses custom Python assertion with LLM-as-judge internally
Update the custom python assertions:
You have probably already downloaded the custom assertion ja_nej_assertion.py when going through section Download the custom python assertions.
If the ja_nej_assertion.py file is placed in the same directory where short-mbu-databeskyttelses-eval-config.yaml is stored, then the assertion is updated like
- type: python
value: file://ja_nej_assertion.py # The path is updated
transform: "output.text"
config:
expectedAnswerCategory: Afvisende
metric: answer
and if ja_nej_assertion.py is placed in e.g. the folder assertions, then the value-path would be file://assertions/ja_nej_assertion.py
TIP:
For detailed documentation on all custom assertions, see Custom Assertions.
Reference:
For detailed reference on the various assertions, see Assertions reference.
Part 4: Set up python
This is needed to run the custom assertions, because the custom assertions are written in python.
To set up python copy paste the following into the terminal in VSCode:
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ promptfoo_customizations
And make sure to update the python executable path in the defaultTest to .venv/Scripts/python.exe, that is the full defaultTest section becomes:
defaultTest:
options:
rubricPrompt: |
[
{
"role": "system",
"content": "Du vurderer om givne svar er fyldestgørende og korrekte i forhold til ideelle ekspert svar. Svar JSON format: {\"reason\": \"string\", \"pass\": boolean, \"score\": number}. ALLE begrundelser skal være på dansk."
},
{
"role": "user",
"content": "<Givet svar>\n\n</Givet svar>\n\n<Ideelt svar>\n\n</Ideelt svar>"
}
]
provider:
id: file:///app/providers/owui.js
label: "judge"
config:
apiEndpointEnvironmentVariable: "DEV_OWUI_ENDPOINT"
apiKeyEnvironmentVariable: "DEV_OWUI_API_KEY"
model: "AarhusAI-default"
outputSources: false
config:
pythonExecutable: .venv/Scripts/python.exe # This is updated
Part 5: Running Evaluations
Now we follow the approach from the running evaluation part of the Getting started session
Step 1: Run the Evaluation
promptfoo eval --config promptfooconfig.yaml
Step 2: View Results
To see the latest evaluation:
promptfoo view
This opens the web UI at http://localhost:15500/eval
Step 3: Understanding Results
Each test can have multiple metrics:
answer- Quality of the actual answerdocRetrival- Whether correct documents were foundcontentRetrieval- Whether correct content chunks were foundrefusal- Whether proper refusal was given
Small exercises
- Note the overall pass rate and the rates for each metric.
- Click on 3 different tests and examine:
- What question was asked?
- What was the actual answer?
- Which assertions passed/failed?
- What documents were retrieved?
Exercise: Add Your Own Test
- Open
promptfooconfig.yamlin VSCode - Find the
tests:section (scroll to the end) - Add a new test case:
- description: 'My first custom test'
vars:
question: 'Hvad er databeskyttelse?'
assert:
- type: icontains
value: 'GDPR'
transform: 'output.text'
metric: answer
metadata:
origin: synthetic
category: 'basic'
- Save the file
- Run the evaluation again:
promptfoo eval --config promptfooconfig.yaml
- Check if your test passes in the web UI
Exercise: Modify the judge LLM prompt
The current judge prompt (called a rubric prompt in promptfoo-lingo) is defined in the defaultTest section
defaultTest:
options:
rubricPrompt: |
[
{
"role": "system",
"content": "Du vurderer om givne svar er fyldestgørende og korrekte i forhold til ideelle ekspert svar. Svar JSON format: {\"reason\": \"string\", \"pass\": boolean, \"score\": number}. ALLE begrundelser skal være på dansk."
},
{
"role": "user",
"content": "<Givet svar>\n{{ output }}\n</Givet svar>\n\n<Ideelt svar>\n{{ rubric }}\n</Ideelt svar>"
}
]
Try to modify it, either you could:
- Translate the text into English
- Change the structure of how the LLM output and the ideal answer is presented to the jugde-LLM. It could be that the
"content"-field of the user message is set to"Givet svar:\n{{ output }}\n\nIdeelt svar:\n{{ rubric }}", or something else, as long as the variables{{ output }}and{{ rubric }}are included. They ensure that the output from the model and the string from thevaluefield (when the rubric-assertion is called in a testcase, like the example above) are pasted into the prompt. - Change how the judge is instructed to evaluate the answer. It could e.g. be that the
"content"-field of the system message is extented to"Du vurderer om givne svar er fyldestgørende og korrekte i forhold til ideelle ekspert svar.\n\nDet er MEGET MEGET VIGTIGT at det givne svar rammer tonen i det ideelle svar.\n\nSvar JSON format: {\"reason\": \"string\", \"pass\": boolean, \"score\": number}. ALLE begrundelser skal være på dansk."
\nnewline characterThe
\nin the text strings above is called the newline character, and it represents a newline. That is, the “character” usually inserted in any text editor when hitting “enter” on your keyboard, but this character is never displayed directly instead the cursor and the following text moves to the next line in the document
After each change to the rubric prompt make sure to run the evaluations again:
promptfoo eval --config promptfooconfig.yaml
and check how (if) it affects the scoring and/or the reason provided by the judge-LLM. This is seen in the web-UI by finding the corresponding test case, clicking the little looking-glass icon and choosing second tab.
Bonus: Available Models
To see all available models in our system, visit:
https://stgai.itkdev.dk/api/v1/models/list
This shows all models you can use as:
- Providers - the model being tested: Usually what is called a “model” in openWebUI lingo and an “assistent” in AarhusAI or os2AI lingo
- Judges - the model evaluating answers (for
llm-rubricand custom Python assertions): Usually this should be what is called base-models in openWebUI-lingo, that would be LLMs where on additional prompts are added beside what we write.
Next Steps
Now that you understand the evaluation structure:
- ✅ Can read and understand promptfoo configs
- ✅ Know the difference between HTTP and OWUI providers
- ✅ Understand
transformResponsevstransform - ✅ Know different assertion types and when to use them
- ✅ Can run evaluations and interpret results
- ✅ Can add and modify test cases
Ready for Session 4? We’ll create tests for Anne Vibekes Bias-checker from scratch.