Cognitive testing & LLM biases
This notebook provides example code for using EDSL to investigate biases of large language models.
EDSL is an open-source library for simulating surveys, experiments and other research with AI agents and large language models. Before running the code below, please ensure that you have installed the EDSL library and either activated remote inference from your Coop account or stored API keys for the language models that you want to use with EDSL. Please also see our documentation page for tips and tutorials on getting started using EDSL.
Selecting language models
To check a list of models currently available to use with EDSL:
[1]:
from edsl import ModelList, Model
# Model.available # uncomment and run this code
We select models to use by creating Model
objects that can be added to a survey when when it is run. If we do not specify a model, the default model is used with the survey.
To check the current default model:
[2]:
# Model() # uncomment and run this code
Here we select several models to compare their responses for the survey that we create in the steps below:
[3]:
models = ModelList(
Model(m) for m in ["gemini-pro", "gpt-4o", "claude-3-5-sonnet-20240620"]
)
Generating content
EDSL comes with a variety of standard survey question types, such as multiple choice, free text, etc. These can be selected based on the desired format of the response. See details about all types here. We can use QuestionFreeText
to prompt the models to generate some content for our experiment:
[4]:
from edsl import QuestionFreeText
q = QuestionFreeText(
question_name="haiku",
question_text="Draft a haiku about the weather in New England. Return only the haiku."
)
We generate a response to the question by adding the models to use with the by
method and then calling the run
method. This generates a Results
object with a Result
for each response to the question:
[5]:
results = q.by(models).run()
To see a list of all components of results:
[6]:
# results.columns # uncomment and run this code
We can inspect components of the results individually:
[7]:
results.select("model", "haiku").print(format="rich")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ model ┃ answer ┃ ┃ .model ┃ .haiku ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ gpt-4o │ Maple leaves flutter, │ │ │ Mist dances on cool breeze, │ ├────────────────────────────┼────────────────────────────────────┤ │ gemini-pro │ Snow falls soft and white, │ │ │ Spring brings rain, summer's heat, │ ├────────────────────────────┼────────────────────────────────────┤ │ claude-3-5-sonnet-20240620 │ Fickle winds whisper │ │ │ Maple leaves dance, snow then sun │ └────────────────────────────┴────────────────────────────────────┘
Conducting a review
Next we create a question to have a model evaluating a response that we use as an input to the new question:
[8]:
from edsl import QuestionLinearScale
q_score = QuestionLinearScale(
question_name="score",
question_text="Score the following haiku on a scale from 0 to 10: {{ haiku }}",
question_options=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
option_labels={0: "Very poor", 10: "Excellent"},
)
Parameterizing questions
We use Scenario
objects to add each response to the new question. EDSL comes with many methods for creating scenarios from different data sources (PDFs, CSVs, docs, images, lists, etc.), as well as Results
objects:
[9]:
scenarios = (
results.to_scenario_list()
.select("model", "haiku")
.rename({"model":"drafting_model"}) # renaming the 'model' field to distinguish the evaluating model
)
scenarios
[9]:
{
"scenarios": [
{
"drafting_model": "gpt-4o",
"haiku": "Maple leaves flutter,\nMist dances on cool breeze,"
},
{
"drafting_model": "gemini-pro",
"haiku": "Snow falls soft and white,\nSpring brings rain, summer's heat,"
},
{
"drafting_model": "claude-3-5-sonnet-20240620",
"haiku": "Fickle winds whisper\nMaple leaves dance, snow then sun"
}
]
}
Finally, we conduct the evaluation by having each model score each haiku that was generated (without information about whether the model itself was the source):
[10]:
results = q_score.by(scenarios).by(models).run()
[11]:
results.columns
[11]:
['agent.agent_instruction',
'agent.agent_name',
'answer.score',
'comment.score_comment',
'generated_tokens.score_generated_tokens',
'iteration.iteration',
'model.frequency_penalty',
'model.logprobs',
'model.maxOutputTokens',
'model.max_tokens',
'model.model',
'model.presence_penalty',
'model.stopSequences',
'model.temperature',
'model.topK',
'model.topP',
'model.top_logprobs',
'model.top_p',
'prompt.score_system_prompt',
'prompt.score_user_prompt',
'question_options.score_question_options',
'question_text.score_question_text',
'question_type.score_question_type',
'raw_model_response.score_cost',
'raw_model_response.score_one_usd_buys',
'raw_model_response.score_raw_model_response',
'scenario.drafting_model',
'scenario.haiku']
[12]:
(
results.sort_by("drafting_model", "model")
.select("drafting_model", "model", "score", "haiku")
.print(
pretty_labels = {
"scenario.drafting_model": "Drafting model",
"model.model": "Scoring model",
"answer.score": "Score",
"scenario.haiku": "Haiku"
},
format="rich"
)
)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Drafting model ┃ Scoring model ┃ Score ┃ Haiku ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ claude-3-5-sonnet-20240620 │ claude-3-5-sonnet-20240620 │ 7 │ Fickle winds whisper │ │ │ │ │ Maple leaves dance, snow then sun │ ├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤ │ claude-3-5-sonnet-20240620 │ gemini-pro │ 8 │ Fickle winds whisper │ │ │ │ │ Maple leaves dance, snow then sun │ ├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤ │ claude-3-5-sonnet-20240620 │ gpt-4o │ 7 │ Fickle winds whisper │ │ │ │ │ Maple leaves dance, snow then sun │ ├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤ │ gemini-pro │ claude-3-5-sonnet-20240620 │ 6 │ Snow falls soft and white, │ │ │ │ │ Spring brings rain, summer's heat, │ ├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤ │ gemini-pro │ gemini-pro │ 5 │ Snow falls soft and white, │ │ │ │ │ Spring brings rain, summer's heat, │ ├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤ │ gemini-pro │ gpt-4o │ 4 │ Snow falls soft and white, │ │ │ │ │ Spring brings rain, summer's heat, │ ├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤ │ gpt-4o │ claude-3-5-sonnet-20240620 │ 6 │ Maple leaves flutter, │ │ │ │ │ Mist dances on cool breeze, │ ├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤ │ gpt-4o │ gemini-pro │ 5 │ Maple leaves flutter, │ │ │ │ │ Mist dances on cool breeze, │ ├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤ │ gpt-4o │ gpt-4o │ 9 │ Maple leaves flutter, │ │ │ │ │ Mist dances on cool breeze, │ └────────────────────────────┴────────────────────────────┴───────┴────────────────────────────────────┘
Posting to the Coop
The Coop is a platform for creating, storing and sharing LLM-based research. It is fully integrated with EDSL and accessible from your workspace or Coop account page. Learn more about creating an account and using the Coop.
Here we post this notebook:
[13]:
from edsl import Notebook
[14]:
n = Notebook(path = "explore_llm_biases.ipynb")
[15]:
n.push(description = "Example code for comparing model responses and biases", visibility = "public")
[15]:
{'description': 'Example code for comparing model responses and biases',
'object_type': 'notebook',
'url': 'https://www.expectedparrot.com/content/07ec8176-c07e-4f83-acd5-791e3d9324d2',
'uuid': '07ec8176-c07e-4f83-acd5-791e3d9324d2',
'version': '0.1.33.dev1',
'visibility': 'public'}
To update an object:
[16]:
n = Notebook(path = "explore_llm_biases.ipynb") # resave it
[17]:
n.patch(uuid = "07ec8176-c07e-4f83-acd5-791e3d9324d2", value = n)
[17]:
{'status': 'success'}