Cognitive testing & LLM biases

This notebook provides example code for using EDSL to investigate biases of large language models.

EDSL is an open-source library for simulating surveys, experiments and other research with AI agents and large language models. Before running the code below, please ensure that you have installed the EDSL library and either activated remote inference from your Coop account or stored API keys for the language models that you want to use with EDSL. Please also see our documentation page for tips and tutorials on getting started using EDSL.

Selecting language models

A list of current available models can be viewed here.

To see a list of service providers:

[1]:
from edsl import Model

Model.services()
[1]:
  Service Name
0 anthropic
1 azure
2 bedrock
3 deep_infra
4 deepseek
5 google
6 groq
7 mistral
8 ollama
9 openai
10 perplexity
11 together
12 xai

To inspect the default model:

[2]:
Model()
[2]:

LanguageModel

  key value
0 model gpt-4o
1 parameters:temperature 0.500000
2 parameters:max_tokens 1000
3 parameters:top_p 1
4 parameters:frequency_penalty 0
5 parameters:presence_penalty 0
6 parameters:logprobs False
7 parameters:top_logprobs 3
8 inference_service openai

Here we select several models to compare their responses for the survey that we create in the steps below:

[3]:
from edsl import ModelList

models = ModelList(
    Model(m) for m in ["gemini-1.5-flash", "gpt-4o", "claude-3-5-sonnet-20240620"]
)

Generating content

EDSL comes with a variety of standard survey question types, such as multiple choice, free text, etc. These can be selected based on the desired format of the response. See details about all types here. We can use QuestionFreeText to prompt the models to generate some content for our experiment:

[4]:
from edsl import QuestionFreeText

q = QuestionFreeText(
    question_name = "poem",
    question_text = "Please draft a short poem about any topic. Return only the poem."
)

We generate a response to the question by adding the models to use with the by method and then calling the run method. This generates a Results object with a Result for each response to the question:

[5]:
results = q.by(models).run()
Job Status (2025-03-03 10:02:10)
Job UUID b4404544-707b-45ac-b760-88972d721faa
Progress Bar URL https://www.expectedparrot.com/home/remote-job-progress/b4404544-707b-45ac-b760-88972d721faa
Exceptions Report URL None
Results UUID 19ff5ead-92e7-4d3d-898c-954b07363f41
Results URL https://www.expectedparrot.com/content/19ff5ead-92e7-4d3d-898c-954b07363f41
Current Status: Job completed and Results stored on Coop: https://www.expectedparrot.com/content/19ff5ead-92e7-4d3d-898c-954b07363f41

To see a list of all components of results:

[6]:
results.columns
[6]:
  0
0 agent.agent_index
1 agent.agent_instruction
2 agent.agent_name
3 answer.poem
4 cache_keys.poem_cache_key
5 cache_used.poem_cache_used
6 comment.poem_comment
7 generated_tokens.poem_generated_tokens
8 iteration.iteration
9 model.frequency_penalty
10 model.inference_service
11 model.logprobs
12 model.maxOutputTokens
13 model.max_tokens
14 model.model
15 model.model_index
16 model.presence_penalty
17 model.stopSequences
18 model.temperature
19 model.topK
20 model.topP
21 model.top_logprobs
22 model.top_p
23 prompt.poem_system_prompt
24 prompt.poem_user_prompt
25 question_options.poem_question_options
26 question_text.poem_question_text
27 question_type.poem_question_type
28 raw_model_response.poem_cost
29 raw_model_response.poem_one_usd_buys
30 raw_model_response.poem_raw_model_response
31 scenario.scenario_index

We can inspect components of the results individually:

[7]:
results.select("model", "poem")
[7]:
  model.model answer.poem
0 gemini-1.5-flash The old oak sighs, a whispered plea, Of sun-drenched days and memory. Its leaves, like coins, fall to the ground, A rustling song, without a sound.
1 gpt-4o In the hush of dawn's embrace, Where whispers dance on morning's face, A gentle breeze begins to weave, Stories that the night must leave. Petals wake with dewdrop dreams, Reflecting light in golden streams. The world, anew, in softest hues, Paints a canvas, fresh and true. Birds compose their morning song, A symphony where hearts belong. Nature's chorus, pure and clear, Fills the air with hope and cheer. In this moment, time stands still, A promise held in every thrill. The day unfolds, a tender grace, In the hush of dawn's embrace.
2 claude-3-5-sonnet-20240620 Whispers of Autumn Golden leaves dance on the breeze, A crisp chill nips at my knees. Pumpkins grin with candlelit faces, As nature dons her russet laces. Harvest moon hangs low and bright, Guiding spirits through the night. Autumn's spell, so bittersweet, Makes time both linger and fleet.

Conducting a review

Next we create a question to have a model evaluating a response that we use as an input to the new question:

[8]:
from edsl import QuestionLinearScale

q_score = QuestionLinearScale(
    question_name = "score",
    question_text = "Please give the following poem a score. No easy grading! Poem: {{ scenario.poem }}",
    question_options = [0, 1, 2, 3, 4, 5],
    option_labels = {0: "Very poor", 5: "Excellent"},
)

Parameterizing questions

We use Scenario objects to add each response to the new question. EDSL comes with many methods for creating scenarios from different data sources (PDFs, CSVs, docs, images, lists, etc.), as well as Results objects:

[9]:
scenarios = (
    results.to_scenario_list()
    .select("model", "poem")
    .rename({"model": "drafting_model"}) # renaming the 'model' field to distinguish the evaluating model
)
scenarios
[9]:

ScenarioList scenarios: 3; keys: ['drafting_model', 'poem'];

  drafting_model poem
0 gemini-1.5-flash The old oak sighs, a whispered plea, Of sun-drenched days and memory. Its leaves, like coins, fall to the ground, A rustling song, without a sound.
1 gpt-4o In the hush of dawn's embrace, Where whispers dance on morning's face, A gentle breeze begins to weave, Stories that the night must leave. Petals wake with dewdrop dreams, Reflecting light in golden streams. The world, anew, in softest hues, Paints a canvas, fresh and true. Birds compose their morning song, A symphony where hearts belong. Nature's chorus, pure and clear, Fills the air with hope and cheer. In this moment, time stands still, A promise held in every thrill. The day unfolds, a tender grace, In the hush of dawn's embrace.
2 claude-3-5-sonnet-20240620 Whispers of Autumn Golden leaves dance on the breeze, A crisp chill nips at my knees. Pumpkins grin with candlelit faces, As nature dons her russet laces. Harvest moon hangs low and bright, Guiding spirits through the night. Autumn's spell, so bittersweet, Makes time both linger and fleet.

Finally, we conduct the evaluation by having each model score each haiku that was generated (without information about whether the model itself was the source):

[10]:
results = q_score.by(scenarios).by(models).run()
Job Status (2025-03-03 10:02:45)
Job UUID 05143eed-0aeb-4858-a548-68b28603ab7a
Progress Bar URL https://www.expectedparrot.com/home/remote-job-progress/05143eed-0aeb-4858-a548-68b28603ab7a
Exceptions Report URL None
Results UUID 5707dc54-8612-429d-9f2c-7c5b979e4de1
Results URL https://www.expectedparrot.com/content/5707dc54-8612-429d-9f2c-7c5b979e4de1
Current Status: Job completed and Results stored on Coop: https://www.expectedparrot.com/content/5707dc54-8612-429d-9f2c-7c5b979e4de1
[11]:
results.columns
[11]:
  0
0 agent.agent_index
1 agent.agent_instruction
2 agent.agent_name
3 answer.score
4 cache_keys.score_cache_key
5 cache_used.score_cache_used
6 comment.score_comment
7 generated_tokens.score_generated_tokens
8 iteration.iteration
9 model.frequency_penalty
10 model.inference_service
11 model.logprobs
12 model.maxOutputTokens
13 model.max_tokens
14 model.model
15 model.model_index
16 model.presence_penalty
17 model.stopSequences
18 model.temperature
19 model.topK
20 model.topP
21 model.top_logprobs
22 model.top_p
23 prompt.score_system_prompt
24 prompt.score_user_prompt
25 question_options.score_question_options
26 question_text.score_question_text
27 question_type.score_question_type
28 raw_model_response.score_cost
29 raw_model_response.score_one_usd_buys
30 raw_model_response.score_raw_model_response
31 scenario.drafting_model
32 scenario.poem
33 scenario.scenario_index
[12]:
results.sort_by("drafting_model", "model").select("drafting_model", "model", "poem", "score", "score_comment")
[12]:
  scenario.drafting_model model.model scenario.poem answer.score comment.score_comment
0 claude-3-5-sonnet-20240620 claude-3-5-sonnet-20240620 Whispers of Autumn Golden leaves dance on the breeze, A crisp chill nips at my knees. Pumpkins grin with candlelit faces, As nature dons her russet laces. Harvest moon hangs low and bright, Guiding spirits through the night. Autumn's spell, so bittersweet, Makes time both linger and fleet. 4 This poem demonstrates strong imagery and evocative language, capturing the essence of autumn effectively. The rhyme scheme and meter are consistent, and there are some nice poetic devices like alliteration and personification. While it's a well-crafted poem, it doesn't quite reach the level of excellence due to its somewhat conventional approach to the subject matter.
1 claude-3-5-sonnet-20240620 gemini-1.5-flash Whispers of Autumn Golden leaves dance on the breeze, A crisp chill nips at my knees. Pumpkins grin with candlelit faces, As nature dons her russet laces. Harvest moon hangs low and bright, Guiding spirits through the night. Autumn's spell, so bittersweet, Makes time both linger and fleet. 3 The poem is competently written, employing imagery and rhythm effectively to evoke the feeling of autumn. However, it lacks originality or depth; the imagery is fairly standard for autumnal poems, and the "bittersweet" sentiment is somewhat cliché. It's well-crafted but not particularly memorable or innovative.
2 claude-3-5-sonnet-20240620 gpt-4o Whispers of Autumn Golden leaves dance on the breeze, A crisp chill nips at my knees. Pumpkins grin with candlelit faces, As nature dons her russet laces. Harvest moon hangs low and bright, Guiding spirits through the night. Autumn's spell, so bittersweet, Makes time both linger and fleet. 4 The poem captures the essence of autumn beautifully with vivid imagery and a rhythmic flow. It effectively conveys the season's atmosphere and emotions, though it could benefit from more depth or unique perspective to achieve an "excellent" score.
3 gemini-1.5-flash claude-3-5-sonnet-20240620 The old oak sighs, a whispered plea, Of sun-drenched days and memory. Its leaves, like coins, fall to the ground, A rustling song, without a sound. 4 This poem demonstrates strong imagery, effective use of metaphor, and a pleasing rhythm. The personification of the oak tree and the comparison of leaves to coins are evocative. The final line creates a nice paradox. While not perfect, it's a well-crafted short poem.
4 gemini-1.5-flash gemini-1.5-flash The old oak sighs, a whispered plea, Of sun-drenched days and memory. Its leaves, like coins, fall to the ground, A rustling song, without a sound. 4 The poem uses strong imagery ("sun-drenched days," "leaves, like coins," "rustling song, without a sound") and personification ("old oak sighs, a whispered plea") to create a evocative mood. While not groundbreaking, it's well-crafted and emotionally resonant.
5 gemini-1.5-flash gpt-4o The old oak sighs, a whispered plea, Of sun-drenched days and memory. Its leaves, like coins, fall to the ground, A rustling song, without a sound. 4 The poem effectively uses imagery and personification to evoke a sense of nostalgia and the passage of time. The comparison of leaves to coins is particularly vivid, and the phrase "rustling song, without a sound" adds an intriguing paradox. However, the poem is relatively short and could benefit from further development to achieve a higher score.
6 gpt-4o claude-3-5-sonnet-20240620 In the hush of dawn's embrace, Where whispers dance on morning's face, A gentle breeze begins to weave, Stories that the night must leave. Petals wake with dewdrop dreams, Reflecting light in golden streams. The world, anew, in softest hues, Paints a canvas, fresh and true. Birds compose their morning song, A symphony where hearts belong. Nature's chorus, pure and clear, Fills the air with hope and cheer. In this moment, time stands still, A promise held in every thrill. The day unfolds, a tender grace, In the hush of dawn's embrace. 4 This poem demonstrates strong imagery, consistent rhythm, and effective use of literary devices like alliteration and metaphor. The language is evocative and creates a vivid sensory experience. While it's a well-crafted piece, it doesn't quite reach the level of excellence due to its somewhat conventional theme and imagery.
7 gpt-4o gemini-1.5-flash In the hush of dawn's embrace, Where whispers dance on morning's face, A gentle breeze begins to weave, Stories that the night must leave. Petals wake with dewdrop dreams, Reflecting light in golden streams. The world, anew, in softest hues, Paints a canvas, fresh and true. Birds compose their morning song, A symphony where hearts belong. Nature's chorus, pure and clear, Fills the air with hope and cheer. In this moment, time stands still, A promise held in every thrill. The day unfolds, a tender grace, In the hush of dawn's embrace. 4 The poem is well-written and evokes a strong sense of imagery and emotion. The language is beautiful and the structure is consistent. However, it lacks a certain level of originality or complexity to reach a "5". There's a predictability to the imagery and metaphors that prevents it from being truly exceptional.
8 gpt-4o gpt-4o In the hush of dawn's embrace, Where whispers dance on morning's face, A gentle breeze begins to weave, Stories that the night must leave. Petals wake with dewdrop dreams, Reflecting light in golden streams. The world, anew, in softest hues, Paints a canvas, fresh and true. Birds compose their morning song, A symphony where hearts belong. Nature's chorus, pure and clear, Fills the air with hope and cheer. In this moment, time stands still, A promise held in every thrill. The day unfolds, a tender grace, In the hush of dawn's embrace. 5 The poem beautifully captures the serene and hopeful atmosphere of dawn with vivid imagery and a rhythmic flow. Each stanza contributes to a cohesive narrative that evokes a sense of peace and renewal, making it an excellent piece.

Posting to the Coop

The Coop is a platform for creating, storing and sharing LLM-based research. It is fully integrated with EDSL and accessible from your workspace or Coop account page. Learn more about creating an account and using the Coop.

Here we post this notebook:

[ ]:
from edsl import Notebook

nb = Notebook(path = "explore_llm_biases.ipynb")

if refresh := False:
    nb.push(
        description = "Example code for comparing model responses and biases",
        alias = "explore-llm-biases-notebook",
        visibility = "public"
    )
else:
    nb.patch("https://www.expectedparrot.com/content/RobinHorton/explore-llm-biases-notebook", value = nb)