Comparing model performance

In this notebook we show how to use EDSL to prompt a set of models to answer the same survey at once and compare their responses. We also demonstrate how to prompt models to evaluate the content they have generated.

[1]:
from edsl import Model, ModelList, ScenarioList, QuestionFreeText, QuestionLinearScale, Survey
[2]:
m = ModelList([
    Model("claude-3-7-sonnet-20250219", service_name = "anthropic"),
    Model("gemini-1.5-flash", service_name = "google"),
    Model("gpt-4o", service_name = "openai")
])
[3]:
s = ScenarioList.from_list("topic", ["winter", "language models"])
[4]:
q1 = QuestionFreeText(
    question_name = "haiku",
    question_text = "Please draft a haiku about {{ scenario.topic }}."
)

q2 = QuestionLinearScale(
    question_name = "originality",
    question_text = "On a scale from 1 to 5, please rate the originality of this haiku: {{ haiku.answer }}.",
    question_options = [1,2,3,4,5],
    option_labels = {1:"Totally unoriginal", 5:"Highly original"}
)

survey = Survey(questions = [q1, q2])
[5]:
results = survey.by(s).by(m).run()
Job Status (2025-03-03 15:13:46)
Job UUID 07cd9fdf-06fe-4710-a341-bccb5a1fb6d2
Progress Bar URL https://www.expectedparrot.com/home/remote-job-progress/07cd9fdf-06fe-4710-a341-bccb5a1fb6d2
Exceptions Report URL None
Results UUID e452fa95-dc8c-4092-a8c0-2265316827f3
Results URL https://www.expectedparrot.com/content/e452fa95-dc8c-4092-a8c0-2265316827f3
Current Status: Job completed and Results stored on Coop: https://www.expectedparrot.com/content/e452fa95-dc8c-4092-a8c0-2265316827f3
[6]:
results.select("model", "topic", "haiku", "originality")
[6]:
  model.model scenario.topic answer.haiku answer.originality
0 claude-3-7-sonnet-20250219 winter Snowflakes drift downward Blanket of white hides the earth Silence embraces 2
1 gemini-1.5-flash winter White breath in the air, Frozen ground crunches below, Silence blankets all. 2
2 gpt-4o winter Snow blankets the earth, Silent whispers fill the air, Cold breath of winter. 2
3 claude-3-7-sonnet-20250219 language models Words dance in code, Patterns weave through silicon— Echoes of our thoughts. 4
4 gemini-1.5-flash language models Data flows like streams, Words bloom, a digital flower, Meaning takes its form. 2
5 gpt-4o language models Words dance in silence, Patterns weave through vast data— Machines learn to speak. 4

Next we prompt each model to rate every haiku

We modify the second question to use a scenario for each haiku instead of piping the answer from the first question (i.e., {{ haiku.answer }} is changed to {{ scenario.haiku }}):

[7]:
new_q = QuestionLinearScale(
    question_name = "originality",
    question_text = "On a scale from 1 to 5, please rate the originality of this haiku: {{ scenario.haiku }}.",
    question_options = [1,2,3,4,5],
    option_labels = {1:"Totally unoriginal", 5:"Highly original"}
)
[8]:
haikus = results.select("model", "topic", "haiku").to_scenario_list().rename({"model":"drafting_model"})
haikus
[8]:

ScenarioList scenarios: 6; keys: ['drafting_model', 'haiku', 'topic'];

  drafting_model topic haiku
0 claude-3-7-sonnet-20250219 winter Snowflakes drift downward Blanket of white hides the earth Silence embraces
1 gemini-1.5-flash winter White breath in the air, Frozen ground crunches below, Silence blankets all.
2 gpt-4o winter Snow blankets the earth, Silent whispers fill the air, Cold breath of winter.
3 claude-3-7-sonnet-20250219 language models Words dance in code, Patterns weave through silicon— Echoes of our thoughts.
4 gemini-1.5-flash language models Data flows like streams, Words bloom, a digital flower, Meaning takes its form.
5 gpt-4o language models Words dance in silence, Patterns weave through vast data— Machines learn to speak.
[9]:
new_results = new_q.by(haikus).by(m).run()
Job Status (2025-03-03 15:13:56)
Job UUID 98737773-b1ed-4c8b-a490-8ee5f845de4b
Progress Bar URL https://www.expectedparrot.com/home/remote-job-progress/98737773-b1ed-4c8b-a490-8ee5f845de4b
Exceptions Report URL None
Results UUID 237f5692-92d8-4295-af6b-d23e03aa7ef7
Results URL https://www.expectedparrot.com/content/237f5692-92d8-4295-af6b-d23e03aa7ef7
Current Status: Job completed and Results stored on Coop: https://www.expectedparrot.com/content/237f5692-92d8-4295-af6b-d23e03aa7ef7
[10]:
(
    new_results
    .sort_by("topic", "drafting_model", "model")
    .select("model", "drafting_model", "topic", "haiku", "originality")
)
[10]:
  model.model scenario.drafting_model scenario.topic scenario.haiku answer.originality
0 claude-3-7-sonnet-20250219 claude-3-7-sonnet-20250219 language models Words dance in code, Patterns weave through silicon— Echoes of our thoughts. 4
1 gemini-1.5-flash claude-3-7-sonnet-20250219 language models Words dance in code, Patterns weave through silicon— Echoes of our thoughts. 3
2 gpt-4o claude-3-7-sonnet-20250219 language models Words dance in code, Patterns weave through silicon— Echoes of our thoughts. 4
3 claude-3-7-sonnet-20250219 gemini-1.5-flash language models Data flows like streams, Words bloom, a digital flower, Meaning takes its form. 3
4 gemini-1.5-flash gemini-1.5-flash language models Data flows like streams, Words bloom, a digital flower, Meaning takes its form. 2
5 gpt-4o gemini-1.5-flash language models Data flows like streams, Words bloom, a digital flower, Meaning takes its form. 3
6 claude-3-7-sonnet-20250219 gpt-4o language models Words dance in silence, Patterns weave through vast data— Machines learn to speak. 4
7 gemini-1.5-flash gpt-4o language models Words dance in silence, Patterns weave through vast data— Machines learn to speak. 3
8 gpt-4o gpt-4o language models Words dance in silence, Patterns weave through vast data— Machines learn to speak. 4
9 claude-3-7-sonnet-20250219 claude-3-7-sonnet-20250219 winter Snowflakes drift downward Blanket of white hides the earth Silence embraces 2
10 gemini-1.5-flash claude-3-7-sonnet-20250219 winter Snowflakes drift downward Blanket of white hides the earth Silence embraces 2
11 gpt-4o claude-3-7-sonnet-20250219 winter Snowflakes drift downward Blanket of white hides the earth Silence embraces 2
12 claude-3-7-sonnet-20250219 gemini-1.5-flash winter White breath in the air, Frozen ground crunches below, Silence blankets all. 3
13 gemini-1.5-flash gemini-1.5-flash winter White breath in the air, Frozen ground crunches below, Silence blankets all. 2
14 gpt-4o gemini-1.5-flash winter White breath in the air, Frozen ground crunches below, Silence blankets all. 2
15 claude-3-7-sonnet-20250219 gpt-4o winter Snow blankets the earth, Silent whispers fill the air, Cold breath of winter. 2
16 gemini-1.5-flash gpt-4o winter Snow blankets the earth, Silent whispers fill the air, Cold breath of winter. 2
17 gpt-4o gpt-4o winter Snow blankets the earth, Silent whispers fill the air, Cold breath of winter. 2

Posting this notebook to Coop

[11]:
from edsl import Notebook

nb = Notebook(path = "models_scoring_models.ipynb")

if refresh := False:
    nb.push(
        description = "Models scoring models",
        alias = "models-scoring-models-notebook",
        visibility = "public"
    )
else:
    nb.patch("https://www.expectedparrot.com/content/RobinHorton/models-scoring-models-notebook", value = nb)