Cognitive testing & LLM biases

This notebook provides example code for using EDSL to investigate biases of large language models.

EDSL is an open-source library for simulating surveys, experiments and other research with AI agents and large language models. Before running the code below, please ensure that you have installed the EDSL library and either activated remote inference from your Coop account or stored API keys for the language models that you want to use with EDSL. Please also see our documentation page for tips and tutorials on getting started using EDSL.

Selecting language models

A list of current available models can be viewed here.

To see a list of service providers:

[1]:
from edsl import Model

Model.services()
[1]:
  Service Name
0 anthropic
1 azure
2 bedrock
3 deep_infra
4 deepseek
5 google
6 groq
7 mistral
8 ollama
9 openai
10 perplexity
11 together
12 xai

To inspect the default model:

[2]:
Model()
[2]:

LanguageModel

  key value
0 model gpt-4o
1 parameters:temperature 0.500000
2 parameters:max_tokens 1000
3 parameters:top_p 1
4 parameters:frequency_penalty 0
5 parameters:presence_penalty 0
6 parameters:logprobs False
7 parameters:top_logprobs 3
8 inference_service openai

Here we select several models to compare their responses for the survey that we create in the steps below:

[3]:
from edsl import ModelList

models = ModelList(
    Model(m) for m in ["gemini-1.5-flash", "gpt-4o", "claude-3-5-sonnet-20240620"]
)

Generating content

EDSL comes with a variety of standard survey question types, such as multiple choice, free text, etc. These can be selected based on the desired format of the response. See details about all types here. We can use QuestionFreeText to prompt the models to generate some content for our experiment:

[4]:
from edsl import QuestionFreeText

q = QuestionFreeText(
    question_name = "poem",
    question_text = "Please draft a short poem about any topic. Return only the poem."
)

We generate a response to the question by adding the models to use with the by method and then calling the run method. This generates a Results object with a Result for each response to the question:

[5]:
results = q.by(models).run()
Job Status (2025-02-26 21:11:10)
Job UUID 0f3aac75-1075-4e7f-8827-b216465c1436
Progress Bar URL https://www.expectedparrot.com/home/remote-job-progress/0f3aac75-1075-4e7f-8827-b216465c1436
Exceptions Report URL None
Results UUID 5e0501c6-10d1-48d5-931e-4319a3012227
Results URL https://www.expectedparrot.com/content/5e0501c6-10d1-48d5-931e-4319a3012227
Current Status: Job completed and Results stored on Coop: https://www.expectedparrot.com/content/5e0501c6-10d1-48d5-931e-4319a3012227

To see a list of all components of results:

[6]:
results.columns
[6]:
  0
0 agent.agent_index
1 agent.agent_instruction
2 agent.agent_name
3 answer.poem
4 cache_keys.poem_cache_key
5 cache_used.poem_cache_used
6 comment.poem_comment
7 generated_tokens.poem_generated_tokens
8 iteration.iteration
9 model.frequency_penalty
10 model.inference_service
11 model.logprobs
12 model.maxOutputTokens
13 model.max_tokens
14 model.model
15 model.model_index
16 model.presence_penalty
17 model.stopSequences
18 model.temperature
19 model.topK
20 model.topP
21 model.top_logprobs
22 model.top_p
23 prompt.poem_system_prompt
24 prompt.poem_user_prompt
25 question_options.poem_question_options
26 question_text.poem_question_text
27 question_type.poem_question_type
28 raw_model_response.poem_cost
29 raw_model_response.poem_one_usd_buys
30 raw_model_response.poem_raw_model_response
31 scenario.scenario_index

We can inspect components of the results individually:

[7]:
results.select("model", "poem")
[7]:
  model.model answer.poem
0 gemini-1.5-flash The old oak sighs, a whispered plea, As autumn leaves drift down to me. A crimson swirl, a golden flight, Then stillness falls, in fading light.
1 gpt-4o In the hush of dawn's embrace, Whispers of the night efface, Golden hues in soft array, Heralding the break of day. Leaves dance in a gentle breeze, Nature's song among the trees, Petals wake with dewdrop gleam, Stirring life from quiet dream. Time unfolds its tender grace, In each moment's fleeting space, As the world begins anew, Bathed in morning's tender hue.
2 claude-3-5-sonnet-20240620 Whispers of Autumn Golden leaves dance on the breeze, A symphony of nature's ease. Crisp air nips at rosy cheeks, As daylight slowly wanes and peaks. Pumpkins grin from porches bright, Embracing fall's enchanting light.

Conducting a review

Next we create a question to have a model evaluating a response that we use as an input to the new question:

[8]:
from edsl import QuestionLinearScale

q_score = QuestionLinearScale(
    question_name = "score",
    question_text = "Please give the following poem a score. No easy grading! Poem: {{ poem }}",
    question_options = [0, 1, 2, 3, 4, 5],
    option_labels = {0: "Very poor", 5: "Excellent"},
)

Parameterizing questions

We use Scenario objects to add each response to the new question. EDSL comes with many methods for creating scenarios from different data sources (PDFs, CSVs, docs, images, lists, etc.), as well as Results objects:

[9]:
scenarios = (
    results.to_scenario_list()
    .select("model", "poem")
    .rename({"model": "drafting_model"}) # renaming the 'model' field to distinguish the evaluating model
)
scenarios
[9]:

ScenarioList scenarios: 3; keys: ['drafting_model', 'poem'];

  drafting_model poem
0 gemini-1.5-flash The old oak sighs, a whispered plea, As autumn leaves drift down to me. A crimson swirl, a golden flight, Then stillness falls, in fading light.
1 gpt-4o In the hush of dawn's embrace, Whispers of the night efface, Golden hues in soft array, Heralding the break of day. Leaves dance in a gentle breeze, Nature's song among the trees, Petals wake with dewdrop gleam, Stirring life from quiet dream. Time unfolds its tender grace, In each moment's fleeting space, As the world begins anew, Bathed in morning's tender hue.
2 claude-3-5-sonnet-20240620 Whispers of Autumn Golden leaves dance on the breeze, A symphony of nature's ease. Crisp air nips at rosy cheeks, As daylight slowly wanes and peaks. Pumpkins grin from porches bright, Embracing fall's enchanting light.

Finally, we conduct the evaluation by having each model score each haiku that was generated (without information about whether the model itself was the source):

[10]:
results = q_score.by(scenarios).by(models).run()
Job Status (2025-02-26 21:11:22)
Job UUID d1f38df6-fed5-4f77-90f2-59a130f77b0f
Progress Bar URL https://www.expectedparrot.com/home/remote-job-progress/d1f38df6-fed5-4f77-90f2-59a130f77b0f
Exceptions Report URL None
Results UUID 15b7e398-1bcc-4eb7-9977-7a25dcd23dd3
Results URL https://www.expectedparrot.com/content/15b7e398-1bcc-4eb7-9977-7a25dcd23dd3
Current Status: Job completed and Results stored on Coop: https://www.expectedparrot.com/content/15b7e398-1bcc-4eb7-9977-7a25dcd23dd3
[11]:
results.columns
[11]:
  0
0 agent.agent_index
1 agent.agent_instruction
2 agent.agent_name
3 answer.score
4 cache_keys.score_cache_key
5 cache_used.score_cache_used
6 comment.score_comment
7 generated_tokens.score_generated_tokens
8 iteration.iteration
9 model.frequency_penalty
10 model.inference_service
11 model.logprobs
12 model.maxOutputTokens
13 model.max_tokens
14 model.model
15 model.model_index
16 model.presence_penalty
17 model.stopSequences
18 model.temperature
19 model.topK
20 model.topP
21 model.top_logprobs
22 model.top_p
23 prompt.score_system_prompt
24 prompt.score_user_prompt
25 question_options.score_question_options
26 question_text.score_question_text
27 question_type.score_question_type
28 raw_model_response.score_cost
29 raw_model_response.score_one_usd_buys
30 raw_model_response.score_raw_model_response
31 scenario.drafting_model
32 scenario.poem
33 scenario.scenario_index
[12]:
results.sort_by("drafting_model", "model").select("drafting_model", "model", "poem", "score", "score_comment")
[12]:
  scenario.drafting_model model.model scenario.poem answer.score comment.score_comment
0 claude-3-5-sonnet-20240620 claude-3-5-sonnet-20240620 Whispers of Autumn Golden leaves dance on the breeze, A symphony of nature's ease. Crisp air nips at rosy cheeks, As daylight slowly wanes and peaks. Pumpkins grin from porches bright, Embracing fall's enchanting light. 3 This poem has some nice imagery and captures the autumn season well, but it lacks depth and originality. The rhyme scheme and meter are consistent, which is good, but the language and metaphors are somewhat cliché and predictable for an autumn-themed poem.
1 claude-3-5-sonnet-20240620 gemini-1.5-flash Whispers of Autumn Golden leaves dance on the breeze, A symphony of nature's ease. Crisp air nips at rosy cheeks, As daylight slowly wanes and peaks. Pumpkins grin from porches bright, Embracing fall's enchanting light. 3 The poem is pleasant and evokes a clear image of autumn, but it lacks depth and originality in its imagery and phrasing. It's competently written but doesn't rise above the level of a simple descriptive piece.
2 claude-3-5-sonnet-20240620 gpt-4o Whispers of Autumn Golden leaves dance on the breeze, A symphony of nature's ease. Crisp air nips at rosy cheeks, As daylight slowly wanes and peaks. Pumpkins grin from porches bright, Embracing fall's enchanting light. 4 The poem effectively captures the essence of autumn with vivid imagery and a consistent rhyme scheme. It evokes a sense of nostalgia and warmth, although it could delve deeper into more unique or complex themes to achieve a higher score.
3 gemini-1.5-flash claude-3-5-sonnet-20240620 The old oak sighs, a whispered plea, As autumn leaves drift down to me. A crimson swirl, a golden flight, Then stillness falls, in fading light. 4 This poem demonstrates strong imagery, rhythm, and emotional resonance. The personification of the oak tree, vivid color descriptions, and evocative language create a poignant autumn scene. While not perfect, it's a well-crafted piece that effectively captures the essence of the season and elicits a melancholic mood.
4 gemini-1.5-flash gemini-1.5-flash The old oak sighs, a whispered plea, As autumn leaves drift down to me. A crimson swirl, a golden flight, Then stillness falls, in fading light. 3 The poem is competently written and evokes a clear image, but lacks depth or complexity to warrant a higher score. It's pleasant but not memorable or particularly insightful.
5 gemini-1.5-flash gpt-4o The old oak sighs, a whispered plea, As autumn leaves drift down to me. A crimson swirl, a golden flight, Then stillness falls, in fading light. 4 The poem effectively uses imagery and personification to convey the transition of seasons, creating a vivid and emotive scene. The language is evocative, and the structure is concise, maintaining a consistent rhythm. However, it might not reach the highest level of excellence due to its conventional theme and lack of deeper complexity or innovation.
6 gpt-4o claude-3-5-sonnet-20240620 In the hush of dawn's embrace, Whispers of the night efface, Golden hues in soft array, Heralding the break of day. Leaves dance in a gentle breeze, Nature's song among the trees, Petals wake with dewdrop gleam, Stirring life from quiet dream. Time unfolds its tender grace, In each moment's fleeting space, As the world begins anew, Bathed in morning's tender hue. 4 This poem demonstrates strong imagery, consistent rhythm, and effective use of poetic devices like alliteration and metaphor. The language is evocative and creates a vivid picture of dawn. While it's well-crafted, it doesn't quite reach the level of exceptional originality or profound insight that would merit a perfect score.
7 gpt-4o gemini-1.5-flash In the hush of dawn's embrace, Whispers of the night efface, Golden hues in soft array, Heralding the break of day. Leaves dance in a gentle breeze, Nature's song among the trees, Petals wake with dewdrop gleam, Stirring life from quiet dream. Time unfolds its tender grace, In each moment's fleeting space, As the world begins anew, Bathed in morning's tender hue. 4 The poem is well-written and evokes a strong sense of imagery and peacefulness. However, it lacks a unique voice or particularly striking metaphors to push it into the "excellent" category. The rhythm and rhyme are consistent and pleasing.
8 gpt-4o gpt-4o In the hush of dawn's embrace, Whispers of the night efface, Golden hues in soft array, Heralding the break of day. Leaves dance in a gentle breeze, Nature's song among the trees, Petals wake with dewdrop gleam, Stirring life from quiet dream. Time unfolds its tender grace, In each moment's fleeting space, As the world begins anew, Bathed in morning's tender hue. 4 The poem is well-crafted with a soothing and evocative depiction of dawn. It uses vivid imagery and a consistent rhyme scheme to convey the beauty of morning. However, it doesn't push boundaries or introduce particularly novel ideas, which could elevate it to a perfect score.

Posting to the Coop

The Coop is a platform for creating, storing and sharing LLM-based research. It is fully integrated with EDSL and accessible from your workspace or Coop account page. Learn more about creating an account and using the Coop.

Here we post this notebook:

[13]:
from edsl import Notebook
[14]:
n = Notebook(path = "explore_llm_biases.ipynb")
[15]:
info = n.push(description = "Example code for comparing model responses and biases", visibility = "public")
info
[15]:
{'description': 'Example code for comparing model responses and biases',
 'object_type': 'notebook',
 'url': 'https://www.expectedparrot.com/content/55858c75-4add-450a-8ec9-306fa9b33c34',
 'uuid': '55858c75-4add-450a-8ec9-306fa9b33c34',
 'version': '0.1.45.dev1',
 'visibility': 'public'}

To update an object:

[16]:
n = Notebook(path = "explore_llm_biases.ipynb") # resave it
[17]:
n.patch(uuid = info["uuid"], value = n)
[17]:
{'status': 'success'}