Comparing model performance
In this notebook we show how to use EDSL to prompt a set of models to answer the same survey at once and compare their responses. We also demonstrate how to prompt models to evaluate the content they have generated.
[1]:
from edsl import Model, ModelList, ScenarioList, QuestionFreeText, QuestionLinearScale, Survey
[2]:
m = ModelList([
Model("claude-3-7-sonnet-20250219", service_name = "anthropic"),
Model("gemini-1.5-flash", service_name = "google"),
Model("gpt-4o", service_name = "openai")
])
[3]:
s = ScenarioList.from_list("topic", ["winter", "language models"])
[4]:
q1 = QuestionFreeText(
question_name = "haiku",
question_text = "Please draft a haiku about {{ scenario.topic }}."
)
q2 = QuestionLinearScale(
question_name = "originality",
question_text = "On a scale from 1 to 5, please rate the originality of this haiku: {{ haiku.answer }}.",
question_options = [1,2,3,4,5],
option_labels = {1:"Totally unoriginal", 5:"Highly original"}
)
survey = Survey(questions = [q1, q2])
[5]:
results = survey.by(s).by(m).run()
▼
Job Status (2025-03-03 15:13:46)
Job UUID | 07cd9fdf-06fe-4710-a341-bccb5a1fb6d2 |
Progress Bar URL | https://www.expectedparrot.com/home/remote-job-progress/07cd9fdf-06fe-4710-a341-bccb5a1fb6d2 |
Exceptions Report URL | None |
Results UUID | e452fa95-dc8c-4092-a8c0-2265316827f3 |
Results URL | https://www.expectedparrot.com/content/e452fa95-dc8c-4092-a8c0-2265316827f3 |
✓Current Status: Job completed and Results stored on Coop: https://www.expectedparrot.com/content/e452fa95-dc8c-4092-a8c0-2265316827f3
[6]:
results.select("model", "topic", "haiku", "originality")
[6]:
model.model | scenario.topic | answer.haiku | answer.originality | |
---|---|---|---|---|
0 | claude-3-7-sonnet-20250219 | winter | Snowflakes drift downward Blanket of white hides the earth Silence embraces | 2 |
1 | gemini-1.5-flash | winter | White breath in the air, Frozen ground crunches below, Silence blankets all. | 2 |
2 | gpt-4o | winter | Snow blankets the earth, Silent whispers fill the air, Cold breath of winter. | 2 |
3 | claude-3-7-sonnet-20250219 | language models | Words dance in code, Patterns weave through silicon— Echoes of our thoughts. | 4 |
4 | gemini-1.5-flash | language models | Data flows like streams, Words bloom, a digital flower, Meaning takes its form. | 2 |
5 | gpt-4o | language models | Words dance in silence, Patterns weave through vast data— Machines learn to speak. | 4 |
Next we prompt each model to rate every haiku
We modify the second question to use a scenario for each haiku instead of piping the answer from the first question (i.e., {{ haiku.answer }}
is changed to {{ scenario.haiku }}
):
[7]:
new_q = QuestionLinearScale(
question_name = "originality",
question_text = "On a scale from 1 to 5, please rate the originality of this haiku: {{ scenario.haiku }}.",
question_options = [1,2,3,4,5],
option_labels = {1:"Totally unoriginal", 5:"Highly original"}
)
[8]:
haikus = results.select("model", "topic", "haiku").to_scenario_list().rename({"model":"drafting_model"})
haikus
[8]:
ScenarioList scenarios: 6; keys: ['drafting_model', 'haiku', 'topic'];
drafting_model | topic | haiku | |
---|---|---|---|
0 | claude-3-7-sonnet-20250219 | winter | Snowflakes drift downward Blanket of white hides the earth Silence embraces |
1 | gemini-1.5-flash | winter | White breath in the air, Frozen ground crunches below, Silence blankets all. |
2 | gpt-4o | winter | Snow blankets the earth, Silent whispers fill the air, Cold breath of winter. |
3 | claude-3-7-sonnet-20250219 | language models | Words dance in code, Patterns weave through silicon— Echoes of our thoughts. |
4 | gemini-1.5-flash | language models | Data flows like streams, Words bloom, a digital flower, Meaning takes its form. |
5 | gpt-4o | language models | Words dance in silence, Patterns weave through vast data— Machines learn to speak. |
[9]:
new_results = new_q.by(haikus).by(m).run()
▼
Job Status (2025-03-03 15:13:56)
Job UUID | 98737773-b1ed-4c8b-a490-8ee5f845de4b |
Progress Bar URL | https://www.expectedparrot.com/home/remote-job-progress/98737773-b1ed-4c8b-a490-8ee5f845de4b |
Exceptions Report URL | None |
Results UUID | 237f5692-92d8-4295-af6b-d23e03aa7ef7 |
Results URL | https://www.expectedparrot.com/content/237f5692-92d8-4295-af6b-d23e03aa7ef7 |
✓Current Status: Job completed and Results stored on Coop: https://www.expectedparrot.com/content/237f5692-92d8-4295-af6b-d23e03aa7ef7
[10]:
(
new_results
.sort_by("topic", "drafting_model", "model")
.select("model", "drafting_model", "topic", "haiku", "originality")
)
[10]:
model.model | scenario.drafting_model | scenario.topic | scenario.haiku | answer.originality | |
---|---|---|---|---|---|
0 | claude-3-7-sonnet-20250219 | claude-3-7-sonnet-20250219 | language models | Words dance in code, Patterns weave through silicon— Echoes of our thoughts. | 4 |
1 | gemini-1.5-flash | claude-3-7-sonnet-20250219 | language models | Words dance in code, Patterns weave through silicon— Echoes of our thoughts. | 3 |
2 | gpt-4o | claude-3-7-sonnet-20250219 | language models | Words dance in code, Patterns weave through silicon— Echoes of our thoughts. | 4 |
3 | claude-3-7-sonnet-20250219 | gemini-1.5-flash | language models | Data flows like streams, Words bloom, a digital flower, Meaning takes its form. | 3 |
4 | gemini-1.5-flash | gemini-1.5-flash | language models | Data flows like streams, Words bloom, a digital flower, Meaning takes its form. | 2 |
5 | gpt-4o | gemini-1.5-flash | language models | Data flows like streams, Words bloom, a digital flower, Meaning takes its form. | 3 |
6 | claude-3-7-sonnet-20250219 | gpt-4o | language models | Words dance in silence, Patterns weave through vast data— Machines learn to speak. | 4 |
7 | gemini-1.5-flash | gpt-4o | language models | Words dance in silence, Patterns weave through vast data— Machines learn to speak. | 3 |
8 | gpt-4o | gpt-4o | language models | Words dance in silence, Patterns weave through vast data— Machines learn to speak. | 4 |
9 | claude-3-7-sonnet-20250219 | claude-3-7-sonnet-20250219 | winter | Snowflakes drift downward Blanket of white hides the earth Silence embraces | 2 |
10 | gemini-1.5-flash | claude-3-7-sonnet-20250219 | winter | Snowflakes drift downward Blanket of white hides the earth Silence embraces | 2 |
11 | gpt-4o | claude-3-7-sonnet-20250219 | winter | Snowflakes drift downward Blanket of white hides the earth Silence embraces | 2 |
12 | claude-3-7-sonnet-20250219 | gemini-1.5-flash | winter | White breath in the air, Frozen ground crunches below, Silence blankets all. | 3 |
13 | gemini-1.5-flash | gemini-1.5-flash | winter | White breath in the air, Frozen ground crunches below, Silence blankets all. | 2 |
14 | gpt-4o | gemini-1.5-flash | winter | White breath in the air, Frozen ground crunches below, Silence blankets all. | 2 |
15 | claude-3-7-sonnet-20250219 | gpt-4o | winter | Snow blankets the earth, Silent whispers fill the air, Cold breath of winter. | 2 |
16 | gemini-1.5-flash | gpt-4o | winter | Snow blankets the earth, Silent whispers fill the air, Cold breath of winter. | 2 |
17 | gpt-4o | gpt-4o | winter | Snow blankets the earth, Silent whispers fill the air, Cold breath of winter. | 2 |
Posting this notebook to Coop
[11]:
from edsl import Notebook
nb = Notebook(path = "models_scoring_models.ipynb")
if refresh := False:
nb.push(
description = "Models scoring models",
alias = "models-scoring-models-notebook",
visibility = "public"
)
else:
nb.patch("https://www.expectedparrot.com/content/RobinHorton/models-scoring-models-notebook", value = nb)