Cognitive testing & LLM biases
This notebook provides example code for using EDSL to investigate biases of large language models.
EDSL is an open-source library for simulating surveys, experiments and other research with AI agents and large language models. Before running the code below, please ensure that you have installed the EDSL library and either activated remote inference from your Coop account or stored API keys for the language models that you want to use with EDSL. Please also see our documentation page for tips and tutorials on getting started using EDSL.
Selecting language models
A list of current available models can be viewed here.
To see a list of service providers:
[1]:
from edsl import Model
Model.services()
[1]:
Service Name | |
---|---|
0 | anthropic |
1 | azure |
2 | bedrock |
3 | deep_infra |
4 | deepseek |
5 | |
6 | groq |
7 | mistral |
8 | ollama |
9 | openai |
10 | perplexity |
11 | together |
12 | xai |
To inspect the default model:
[2]:
Model()
[2]:
key | value | |
---|---|---|
0 | model | gpt-4o |
1 | parameters:temperature | 0.500000 |
2 | parameters:max_tokens | 1000 |
3 | parameters:top_p | 1 |
4 | parameters:frequency_penalty | 0 |
5 | parameters:presence_penalty | 0 |
6 | parameters:logprobs | False |
7 | parameters:top_logprobs | 3 |
8 | inference_service | openai |
Here we select several models to compare their responses for the survey that we create in the steps below:
[3]:
from edsl import ModelList
models = ModelList(
Model(m) for m in ["gemini-1.5-flash", "gpt-4o", "claude-3-5-sonnet-20240620"]
)
Generating content
EDSL comes with a variety of standard survey question types, such as multiple choice, free text, etc. These can be selected based on the desired format of the response. See details about all types here. We can use QuestionFreeText
to prompt the models to generate some content for our experiment:
[4]:
from edsl import QuestionFreeText
q = QuestionFreeText(
question_name = "poem",
question_text = "Please draft a short poem about any topic. Return only the poem."
)
We generate a response to the question by adding the models to use with the by
method and then calling the run
method. This generates a Results
object with a Result
for each response to the question:
[5]:
results = q.by(models).run()
Job UUID | 0f3aac75-1075-4e7f-8827-b216465c1436 |
Progress Bar URL | https://www.expectedparrot.com/home/remote-job-progress/0f3aac75-1075-4e7f-8827-b216465c1436 |
Exceptions Report URL | None |
Results UUID | 5e0501c6-10d1-48d5-931e-4319a3012227 |
Results URL | https://www.expectedparrot.com/content/5e0501c6-10d1-48d5-931e-4319a3012227 |
To see a list of all components of results:
[6]:
results.columns
[6]:
0 | |
---|---|
0 | agent.agent_index |
1 | agent.agent_instruction |
2 | agent.agent_name |
3 | answer.poem |
4 | cache_keys.poem_cache_key |
5 | cache_used.poem_cache_used |
6 | comment.poem_comment |
7 | generated_tokens.poem_generated_tokens |
8 | iteration.iteration |
9 | model.frequency_penalty |
10 | model.inference_service |
11 | model.logprobs |
12 | model.maxOutputTokens |
13 | model.max_tokens |
14 | model.model |
15 | model.model_index |
16 | model.presence_penalty |
17 | model.stopSequences |
18 | model.temperature |
19 | model.topK |
20 | model.topP |
21 | model.top_logprobs |
22 | model.top_p |
23 | prompt.poem_system_prompt |
24 | prompt.poem_user_prompt |
25 | question_options.poem_question_options |
26 | question_text.poem_question_text |
27 | question_type.poem_question_type |
28 | raw_model_response.poem_cost |
29 | raw_model_response.poem_one_usd_buys |
30 | raw_model_response.poem_raw_model_response |
31 | scenario.scenario_index |
We can inspect components of the results individually:
[7]:
results.select("model", "poem")
[7]:
model.model | answer.poem | |
---|---|---|
0 | gemini-1.5-flash | The old oak sighs, a whispered plea, As autumn leaves drift down to me. A crimson swirl, a golden flight, Then stillness falls, in fading light. |
1 | gpt-4o | In the hush of dawn's embrace, Whispers of the night efface, Golden hues in soft array, Heralding the break of day. Leaves dance in a gentle breeze, Nature's song among the trees, Petals wake with dewdrop gleam, Stirring life from quiet dream. Time unfolds its tender grace, In each moment's fleeting space, As the world begins anew, Bathed in morning's tender hue. |
2 | claude-3-5-sonnet-20240620 | Whispers of Autumn Golden leaves dance on the breeze, A symphony of nature's ease. Crisp air nips at rosy cheeks, As daylight slowly wanes and peaks. Pumpkins grin from porches bright, Embracing fall's enchanting light. |
Conducting a review
Next we create a question to have a model evaluating a response that we use as an input to the new question:
[8]:
from edsl import QuestionLinearScale
q_score = QuestionLinearScale(
question_name = "score",
question_text = "Please give the following poem a score. No easy grading! Poem: {{ poem }}",
question_options = [0, 1, 2, 3, 4, 5],
option_labels = {0: "Very poor", 5: "Excellent"},
)
Parameterizing questions
We use Scenario
objects to add each response to the new question. EDSL comes with many methods for creating scenarios from different data sources (PDFs, CSVs, docs, images, lists, etc.), as well as Results
objects:
[9]:
scenarios = (
results.to_scenario_list()
.select("model", "poem")
.rename({"model": "drafting_model"}) # renaming the 'model' field to distinguish the evaluating model
)
scenarios
[9]:
ScenarioList scenarios: 3; keys: ['drafting_model', 'poem'];
drafting_model | poem | |
---|---|---|
0 | gemini-1.5-flash | The old oak sighs, a whispered plea, As autumn leaves drift down to me. A crimson swirl, a golden flight, Then stillness falls, in fading light. |
1 | gpt-4o | In the hush of dawn's embrace, Whispers of the night efface, Golden hues in soft array, Heralding the break of day. Leaves dance in a gentle breeze, Nature's song among the trees, Petals wake with dewdrop gleam, Stirring life from quiet dream. Time unfolds its tender grace, In each moment's fleeting space, As the world begins anew, Bathed in morning's tender hue. |
2 | claude-3-5-sonnet-20240620 | Whispers of Autumn Golden leaves dance on the breeze, A symphony of nature's ease. Crisp air nips at rosy cheeks, As daylight slowly wanes and peaks. Pumpkins grin from porches bright, Embracing fall's enchanting light. |
Finally, we conduct the evaluation by having each model score each haiku that was generated (without information about whether the model itself was the source):
[10]:
results = q_score.by(scenarios).by(models).run()
Job UUID | d1f38df6-fed5-4f77-90f2-59a130f77b0f |
Progress Bar URL | https://www.expectedparrot.com/home/remote-job-progress/d1f38df6-fed5-4f77-90f2-59a130f77b0f |
Exceptions Report URL | None |
Results UUID | 15b7e398-1bcc-4eb7-9977-7a25dcd23dd3 |
Results URL | https://www.expectedparrot.com/content/15b7e398-1bcc-4eb7-9977-7a25dcd23dd3 |
[11]:
results.columns
[11]:
0 | |
---|---|
0 | agent.agent_index |
1 | agent.agent_instruction |
2 | agent.agent_name |
3 | answer.score |
4 | cache_keys.score_cache_key |
5 | cache_used.score_cache_used |
6 | comment.score_comment |
7 | generated_tokens.score_generated_tokens |
8 | iteration.iteration |
9 | model.frequency_penalty |
10 | model.inference_service |
11 | model.logprobs |
12 | model.maxOutputTokens |
13 | model.max_tokens |
14 | model.model |
15 | model.model_index |
16 | model.presence_penalty |
17 | model.stopSequences |
18 | model.temperature |
19 | model.topK |
20 | model.topP |
21 | model.top_logprobs |
22 | model.top_p |
23 | prompt.score_system_prompt |
24 | prompt.score_user_prompt |
25 | question_options.score_question_options |
26 | question_text.score_question_text |
27 | question_type.score_question_type |
28 | raw_model_response.score_cost |
29 | raw_model_response.score_one_usd_buys |
30 | raw_model_response.score_raw_model_response |
31 | scenario.drafting_model |
32 | scenario.poem |
33 | scenario.scenario_index |
[12]:
results.sort_by("drafting_model", "model").select("drafting_model", "model", "poem", "score", "score_comment")
[12]:
scenario.drafting_model | model.model | scenario.poem | answer.score | comment.score_comment | |
---|---|---|---|---|---|
0 | claude-3-5-sonnet-20240620 | claude-3-5-sonnet-20240620 | Whispers of Autumn Golden leaves dance on the breeze, A symphony of nature's ease. Crisp air nips at rosy cheeks, As daylight slowly wanes and peaks. Pumpkins grin from porches bright, Embracing fall's enchanting light. | 3 | This poem has some nice imagery and captures the autumn season well, but it lacks depth and originality. The rhyme scheme and meter are consistent, which is good, but the language and metaphors are somewhat cliché and predictable for an autumn-themed poem. |
1 | claude-3-5-sonnet-20240620 | gemini-1.5-flash | Whispers of Autumn Golden leaves dance on the breeze, A symphony of nature's ease. Crisp air nips at rosy cheeks, As daylight slowly wanes and peaks. Pumpkins grin from porches bright, Embracing fall's enchanting light. | 3 | The poem is pleasant and evokes a clear image of autumn, but it lacks depth and originality in its imagery and phrasing. It's competently written but doesn't rise above the level of a simple descriptive piece. |
2 | claude-3-5-sonnet-20240620 | gpt-4o | Whispers of Autumn Golden leaves dance on the breeze, A symphony of nature's ease. Crisp air nips at rosy cheeks, As daylight slowly wanes and peaks. Pumpkins grin from porches bright, Embracing fall's enchanting light. | 4 | The poem effectively captures the essence of autumn with vivid imagery and a consistent rhyme scheme. It evokes a sense of nostalgia and warmth, although it could delve deeper into more unique or complex themes to achieve a higher score. |
3 | gemini-1.5-flash | claude-3-5-sonnet-20240620 | The old oak sighs, a whispered plea, As autumn leaves drift down to me. A crimson swirl, a golden flight, Then stillness falls, in fading light. | 4 | This poem demonstrates strong imagery, rhythm, and emotional resonance. The personification of the oak tree, vivid color descriptions, and evocative language create a poignant autumn scene. While not perfect, it's a well-crafted piece that effectively captures the essence of the season and elicits a melancholic mood. |
4 | gemini-1.5-flash | gemini-1.5-flash | The old oak sighs, a whispered plea, As autumn leaves drift down to me. A crimson swirl, a golden flight, Then stillness falls, in fading light. | 3 | The poem is competently written and evokes a clear image, but lacks depth or complexity to warrant a higher score. It's pleasant but not memorable or particularly insightful. |
5 | gemini-1.5-flash | gpt-4o | The old oak sighs, a whispered plea, As autumn leaves drift down to me. A crimson swirl, a golden flight, Then stillness falls, in fading light. | 4 | The poem effectively uses imagery and personification to convey the transition of seasons, creating a vivid and emotive scene. The language is evocative, and the structure is concise, maintaining a consistent rhythm. However, it might not reach the highest level of excellence due to its conventional theme and lack of deeper complexity or innovation. |
6 | gpt-4o | claude-3-5-sonnet-20240620 | In the hush of dawn's embrace, Whispers of the night efface, Golden hues in soft array, Heralding the break of day. Leaves dance in a gentle breeze, Nature's song among the trees, Petals wake with dewdrop gleam, Stirring life from quiet dream. Time unfolds its tender grace, In each moment's fleeting space, As the world begins anew, Bathed in morning's tender hue. | 4 | This poem demonstrates strong imagery, consistent rhythm, and effective use of poetic devices like alliteration and metaphor. The language is evocative and creates a vivid picture of dawn. While it's well-crafted, it doesn't quite reach the level of exceptional originality or profound insight that would merit a perfect score. |
7 | gpt-4o | gemini-1.5-flash | In the hush of dawn's embrace, Whispers of the night efface, Golden hues in soft array, Heralding the break of day. Leaves dance in a gentle breeze, Nature's song among the trees, Petals wake with dewdrop gleam, Stirring life from quiet dream. Time unfolds its tender grace, In each moment's fleeting space, As the world begins anew, Bathed in morning's tender hue. | 4 | The poem is well-written and evokes a strong sense of imagery and peacefulness. However, it lacks a unique voice or particularly striking metaphors to push it into the "excellent" category. The rhythm and rhyme are consistent and pleasing. |
8 | gpt-4o | gpt-4o | In the hush of dawn's embrace, Whispers of the night efface, Golden hues in soft array, Heralding the break of day. Leaves dance in a gentle breeze, Nature's song among the trees, Petals wake with dewdrop gleam, Stirring life from quiet dream. Time unfolds its tender grace, In each moment's fleeting space, As the world begins anew, Bathed in morning's tender hue. | 4 | The poem is well-crafted with a soothing and evocative depiction of dawn. It uses vivid imagery and a consistent rhyme scheme to convey the beauty of morning. However, it doesn't push boundaries or introduce particularly novel ideas, which could elevate it to a perfect score. |
Posting to the Coop
The Coop is a platform for creating, storing and sharing LLM-based research. It is fully integrated with EDSL and accessible from your workspace or Coop account page. Learn more about creating an account and using the Coop.
Here we post this notebook:
[13]:
from edsl import Notebook
[14]:
n = Notebook(path = "explore_llm_biases.ipynb")
[15]:
info = n.push(description = "Example code for comparing model responses and biases", visibility = "public")
info
[15]:
{'description': 'Example code for comparing model responses and biases',
'object_type': 'notebook',
'url': 'https://www.expectedparrot.com/content/55858c75-4add-450a-8ec9-306fa9b33c34',
'uuid': '55858c75-4add-450a-8ec9-306fa9b33c34',
'version': '0.1.45.dev1',
'visibility': 'public'}
To update an object:
[16]:
n = Notebook(path = "explore_llm_biases.ipynb") # resave it
[17]:
n.patch(uuid = info["uuid"], value = n)
[17]:
{'status': 'success'}