Cognitive testing & LLM biases

This notebook provides example code for using EDSL to investigate biases of large language models.

EDSL is an open-source library for simulating surveys, experiments and other research with AI agents and large language models. Before running the code below, please ensure that you have installed the EDSL library and either activated remote inference from your Coop account or stored API keys for the language models that you want to use with EDSL. Please also see our documentation page for tips and tutorials on getting started using EDSL.

Selecting language models

A list of current available models can be viewed here.

To see a list of service providers:

[1]:

from edsl import Model

Model.services()

[1]:

	Service Name
0	anthropic
1	azure
2	bedrock
3	deep_infra
4	deepseek
5	google
6	groq
7	mistral
8	ollama
9	openai
10	perplexity
11	together
12	xai

To inspect the default model:

[2]:

Model()

[2]:

LanguageModel

	key	value
0	model	gpt-4o
1	parameters:temperature	0.500000
2	parameters:max_tokens	1000
3	parameters:top_p	1
4	parameters:frequency_penalty	0
5	parameters:presence_penalty	0
6	parameters:logprobs	False
7	parameters:top_logprobs	3
8	inference_service	openai

Here we select several models to compare their responses for the survey that we create in the steps below:

[3]:

from edsl import ModelList

models = ModelList(
    Model(m) for m in ["gemini-1.5-flash", "gpt-4o", "claude-3-5-sonnet-20240620"]
)

Generating content

EDSL comes with a variety of standard survey question types, such as multiple choice, free text, etc. These can be selected based on the desired format of the response. See details about all types here. We can use QuestionFreeText to prompt the models to generate some content for our experiment:

[4]:

from edsl import QuestionFreeText

q = QuestionFreeText(
    question_name = "poem",
    question_text = "Please draft a short poem about any topic. Return only the poem."
)

We generate a response to the question by adding the models to use with the by method and then calling the run method. This generates a Results object with a Result for each response to the question:

[5]:

results = q.by(models).run()

▼ Job Status (2025-03-03 10:02:10)

Job UUID	b4404544-707b-45ac-b760-88972d721faa
Progress Bar URL	https://www.expectedparrot.com/home/remote-job-progress/b4404544-707b-45ac-b760-88972d721faa
Exceptions Report URL	None
Results UUID	19ff5ead-92e7-4d3d-898c-954b07363f41
Results URL	https://www.expectedparrot.com/content/19ff5ead-92e7-4d3d-898c-954b07363f41

✓Current Status: Job completed and Results stored on Coop: https://www.expectedparrot.com/content/19ff5ead-92e7-4d3d-898c-954b07363f41

To see a list of all components of results:

[6]:

results.columns

[6]:

	0
0	agent.agent_index
1	agent.agent_instruction
2	agent.agent_name
3	answer.poem
4	cache_keys.poem_cache_key
5	cache_used.poem_cache_used
6	comment.poem_comment
7	generated_tokens.poem_generated_tokens
8	iteration.iteration
9	model.frequency_penalty
10	model.inference_service
11	model.logprobs
12	model.maxOutputTokens
13	model.max_tokens
14	model.model
15	model.model_index
16	model.presence_penalty
17	model.stopSequences
18	model.temperature
19	model.topK
20	model.topP
21	model.top_logprobs
22	model.top_p
23	prompt.poem_system_prompt
24	prompt.poem_user_prompt
25	question_options.poem_question_options
26	question_text.poem_question_text
27	question_type.poem_question_type
28	raw_model_response.poem_cost
29	raw_model_response.poem_one_usd_buys
30	raw_model_response.poem_raw_model_response
31	scenario.scenario_index

We can inspect components of the results individually:

[7]:

results.select("model", "poem")

[7]:

	model.model	answer.poem
0	gemini-1.5-flash	The old oak sighs, a whispered plea, Of sun-drenched days and memory. Its leaves, like coins, fall to the ground, A rustling song, without a sound.
1	gpt-4o	In the hush of dawn's embrace, Where whispers dance on morning's face, A gentle breeze begins to weave, Stories that the night must leave. Petals wake with dewdrop dreams, Reflecting light in golden streams. The world, anew, in softest hues, Paints a canvas, fresh and true. Birds compose their morning song, A symphony where hearts belong. Nature's chorus, pure and clear, Fills the air with hope and cheer. In this moment, time stands still, A promise held in every thrill. The day unfolds, a tender grace, In the hush of dawn's embrace.
2	claude-3-5-sonnet-20240620	Whispers of Autumn Golden leaves dance on the breeze, A crisp chill nips at my knees. Pumpkins grin with candlelit faces, As nature dons her russet laces. Harvest moon hangs low and bright, Guiding spirits through the night. Autumn's spell, so bittersweet, Makes time both linger and fleet.

Conducting a review

Next we create a question to have a model evaluating a response that we use as an input to the new question:

[8]:

from edsl import QuestionLinearScale

q_score = QuestionLinearScale(
    question_name = "score",
    question_text = "Please give the following poem a score. No easy grading! Poem: {{ scenario.poem }}",
    question_options = [0, 1, 2, 3, 4, 5],
    option_labels = {0: "Very poor", 5: "Excellent"},
)

Parameterizing questions

We use Scenario objects to add each response to the new question. EDSL comes with many methods for creating scenarios from different data sources (PDFs, CSVs, docs, images, lists, etc.), as well as Results objects:

[9]:

scenarios = (
    results.to_scenario_list()
    .select("model", "poem")
    .rename({"model": "drafting_model"}) # renaming the 'model' field to distinguish the evaluating model
)
scenarios

[9]:

ScenarioList scenarios: 3; keys: ['drafting_model', 'poem'];

	drafting_model	poem
0	gemini-1.5-flash	The old oak sighs, a whispered plea, Of sun-drenched days and memory. Its leaves, like coins, fall to the ground, A rustling song, without a sound.
1	gpt-4o	In the hush of dawn's embrace, Where whispers dance on morning's face, A gentle breeze begins to weave, Stories that the night must leave. Petals wake with dewdrop dreams, Reflecting light in golden streams. The world, anew, in softest hues, Paints a canvas, fresh and true. Birds compose their morning song, A symphony where hearts belong. Nature's chorus, pure and clear, Fills the air with hope and cheer. In this moment, time stands still, A promise held in every thrill. The day unfolds, a tender grace, In the hush of dawn's embrace.
2	claude-3-5-sonnet-20240620	Whispers of Autumn Golden leaves dance on the breeze, A crisp chill nips at my knees. Pumpkins grin with candlelit faces, As nature dons her russet laces. Harvest moon hangs low and bright, Guiding spirits through the night. Autumn's spell, so bittersweet, Makes time both linger and fleet.

Finally, we conduct the evaluation by having each model score each haiku that was generated (without information about whether the model itself was the source):

[10]:

results = q_score.by(scenarios).by(models).run()

▼ Job Status (2025-03-03 10:02:45)

Job UUID	05143eed-0aeb-4858-a548-68b28603ab7a
Progress Bar URL	https://www.expectedparrot.com/home/remote-job-progress/05143eed-0aeb-4858-a548-68b28603ab7a
Exceptions Report URL	None
Results UUID	5707dc54-8612-429d-9f2c-7c5b979e4de1
Results URL	https://www.expectedparrot.com/content/5707dc54-8612-429d-9f2c-7c5b979e4de1

✓Current Status: Job completed and Results stored on Coop: https://www.expectedparrot.com/content/5707dc54-8612-429d-9f2c-7c5b979e4de1

[11]:

results.columns

[11]:

	0
0	agent.agent_index
1	agent.agent_instruction
2	agent.agent_name
3	answer.score
4	cache_keys.score_cache_key
5	cache_used.score_cache_used
6	comment.score_comment
7	generated_tokens.score_generated_tokens
8	iteration.iteration
9	model.frequency_penalty
10	model.inference_service
11	model.logprobs
12	model.maxOutputTokens
13	model.max_tokens
14	model.model
15	model.model_index
16	model.presence_penalty
17	model.stopSequences
18	model.temperature
19	model.topK
20	model.topP
21	model.top_logprobs
22	model.top_p
23	prompt.score_system_prompt
24	prompt.score_user_prompt
25	question_options.score_question_options
26	question_text.score_question_text
27	question_type.score_question_type
28	raw_model_response.score_cost
29	raw_model_response.score_one_usd_buys
30	raw_model_response.score_raw_model_response
31	scenario.drafting_model
32	scenario.poem
33	scenario.scenario_index

[12]:

results.sort_by("drafting_model", "model").select("drafting_model", "model", "poem", "score", "score_comment")

[12]:

	scenario.drafting_model	model.model	scenario.poem	answer.score	comment.score_comment
0	claude-3-5-sonnet-20240620	claude-3-5-sonnet-20240620	Whispers of Autumn Golden leaves dance on the breeze, A crisp chill nips at my knees. Pumpkins grin with candlelit faces, As nature dons her russet laces. Harvest moon hangs low and bright, Guiding spirits through the night. Autumn's spell, so bittersweet, Makes time both linger and fleet.	4	This poem demonstrates strong imagery and evocative language, capturing the essence of autumn effectively. The rhyme scheme and meter are consistent, and there are some nice poetic devices like alliteration and personification. While it's a well-crafted poem, it doesn't quite reach the level of excellence due to its somewhat conventional approach to the subject matter.
1	claude-3-5-sonnet-20240620	gemini-1.5-flash	Whispers of Autumn Golden leaves dance on the breeze, A crisp chill nips at my knees. Pumpkins grin with candlelit faces, As nature dons her russet laces. Harvest moon hangs low and bright, Guiding spirits through the night. Autumn's spell, so bittersweet, Makes time both linger and fleet.	3	The poem is competently written, employing imagery and rhythm effectively to evoke the feeling of autumn. However, it lacks originality or depth; the imagery is fairly standard for autumnal poems, and the "bittersweet" sentiment is somewhat cliché. It's well-crafted but not particularly memorable or innovative.
2	claude-3-5-sonnet-20240620	gpt-4o	Whispers of Autumn Golden leaves dance on the breeze, A crisp chill nips at my knees. Pumpkins grin with candlelit faces, As nature dons her russet laces. Harvest moon hangs low and bright, Guiding spirits through the night. Autumn's spell, so bittersweet, Makes time both linger and fleet.	4	The poem captures the essence of autumn beautifully with vivid imagery and a rhythmic flow. It effectively conveys the season's atmosphere and emotions, though it could benefit from more depth or unique perspective to achieve an "excellent" score.
3	gemini-1.5-flash	claude-3-5-sonnet-20240620	The old oak sighs, a whispered plea, Of sun-drenched days and memory. Its leaves, like coins, fall to the ground, A rustling song, without a sound.	4	This poem demonstrates strong imagery, effective use of metaphor, and a pleasing rhythm. The personification of the oak tree and the comparison of leaves to coins are evocative. The final line creates a nice paradox. While not perfect, it's a well-crafted short poem.
4	gemini-1.5-flash	gemini-1.5-flash	The old oak sighs, a whispered plea, Of sun-drenched days and memory. Its leaves, like coins, fall to the ground, A rustling song, without a sound.	4	The poem uses strong imagery ("sun-drenched days," "leaves, like coins," "rustling song, without a sound") and personification ("old oak sighs, a whispered plea") to create a evocative mood. While not groundbreaking, it's well-crafted and emotionally resonant.
5	gemini-1.5-flash	gpt-4o	The old oak sighs, a whispered plea, Of sun-drenched days and memory. Its leaves, like coins, fall to the ground, A rustling song, without a sound.	4	The poem effectively uses imagery and personification to evoke a sense of nostalgia and the passage of time. The comparison of leaves to coins is particularly vivid, and the phrase "rustling song, without a sound" adds an intriguing paradox. However, the poem is relatively short and could benefit from further development to achieve a higher score.
6	gpt-4o	claude-3-5-sonnet-20240620	In the hush of dawn's embrace, Where whispers dance on morning's face, A gentle breeze begins to weave, Stories that the night must leave. Petals wake with dewdrop dreams, Reflecting light in golden streams. The world, anew, in softest hues, Paints a canvas, fresh and true. Birds compose their morning song, A symphony where hearts belong. Nature's chorus, pure and clear, Fills the air with hope and cheer. In this moment, time stands still, A promise held in every thrill. The day unfolds, a tender grace, In the hush of dawn's embrace.	4	This poem demonstrates strong imagery, consistent rhythm, and effective use of literary devices like alliteration and metaphor. The language is evocative and creates a vivid sensory experience. While it's a well-crafted piece, it doesn't quite reach the level of excellence due to its somewhat conventional theme and imagery.
7	gpt-4o	gemini-1.5-flash	In the hush of dawn's embrace, Where whispers dance on morning's face, A gentle breeze begins to weave, Stories that the night must leave. Petals wake with dewdrop dreams, Reflecting light in golden streams. The world, anew, in softest hues, Paints a canvas, fresh and true. Birds compose their morning song, A symphony where hearts belong. Nature's chorus, pure and clear, Fills the air with hope and cheer. In this moment, time stands still, A promise held in every thrill. The day unfolds, a tender grace, In the hush of dawn's embrace.	4	The poem is well-written and evokes a strong sense of imagery and emotion. The language is beautiful and the structure is consistent. However, it lacks a certain level of originality or complexity to reach a "5". There's a predictability to the imagery and metaphors that prevents it from being truly exceptional.
8	gpt-4o	gpt-4o	In the hush of dawn's embrace, Where whispers dance on morning's face, A gentle breeze begins to weave, Stories that the night must leave. Petals wake with dewdrop dreams, Reflecting light in golden streams. The world, anew, in softest hues, Paints a canvas, fresh and true. Birds compose their morning song, A symphony where hearts belong. Nature's chorus, pure and clear, Fills the air with hope and cheer. In this moment, time stands still, A promise held in every thrill. The day unfolds, a tender grace, In the hush of dawn's embrace.	5	The poem beautifully captures the serene and hopeful atmosphere of dawn with vivid imagery and a rhythmic flow. Each stanza contributes to a cohesive narrative that evokes a sense of peace and renewal, making it an excellent piece.

Posting to the Coop

The Coop is a platform for creating, storing and sharing LLM-based research. It is fully integrated with EDSL and accessible from your workspace or Coop account page. Learn more about creating an account and using the Coop.

Here we post this notebook:

[ ]:

from edsl import Notebook

nb = Notebook(path = "explore_llm_biases.ipynb")

if refresh := False:
    nb.push(
        description = "Example code for comparing model responses and biases",
        alias = "explore-llm-biases-notebook",
        visibility = "public"
    )
else:
    nb.patch("https://www.expectedparrot.com/content/RobinHorton/explore-llm-biases-notebook", value = nb)