Comparing model performance

In this notebook we show how to use EDSL to prompt a set of models to answer the same survey at once and compare their responses. We also demonstrate how to prompt models to evaluate the content they have generated.

[1]:
from edsl import Model, ModelList, ScenarioList, QuestionFreeText, QuestionLinearScale, Survey
[2]:
m = ModelList([
    Model("claude-3-7-sonnet-20250219", service_name = "anthropic"),
    Model("gemini-1.5-flash", service_name = "google"),
    Model("gpt-4o", service_name = "openai")
])
[3]:
s = ScenarioList.from_source("list", "topic", ["winter", "language models"])
[4]:
q1 = QuestionFreeText(
    question_name = "haiku",
    question_text = "Please draft a haiku about {{ scenario.topic }}."
)

q2 = QuestionLinearScale(
    question_name = "originality",
    question_text = "On a scale from 1 to 5, please rate the originality of this haiku: {{ haiku.answer }}.",
    question_options = [1,2,3,4,5],
    option_labels = {1:"Totally unoriginal", 5:"Highly original"}
)

survey = Survey(questions = [q1, q2])

survey
[4]:

Survey # questions: 2; question_name list: ['haiku', 'originality'];

  option_labels question_text question_name question_options question_type
0 nan Please draft a haiku about {{ scenario.topic }}. haiku nan free_text
1 {1: 'Totally unoriginal', 5: 'Highly original'} On a scale from 1 to 5, please rate the originality of this haiku: {{ haiku.answer }}. originality [1, 2, 3, 4, 5] linear_scale
[5]:
results = survey.by(s).by(m).run()
Job Status 🦜
Completed (6 completed, 0 failed)
Identifiers
Results UUID:
2990fe19...8aa5
Use Results.pull(uuid) to fetch results.
Job UUID:
d5494839...0c4c
Use Jobs.pull(uuid) to fetch job.
Status: Completed
Last updated: 2025-06-07 17:51:47
17:51:47
Job completed and Results stored on Coop. View Results
17:51:42
Job status: running - last update: 2025-06-07 05:51:42 PM
17:51:38
Job status: queued - last update: 2025-06-07 05:51:38 PM
17:51:37
View job progress here
17:51:37
Job details are available at your Coop account. Go to Remote Inference page
17:51:37
Job sent to server. (Job uuid=d5494839-8cfe-41cb-97fa-4fbe39040c4c).
17:51:37
Your survey is running at the Expected Parrot server...
17:51:36
Remote inference activated. Sending job to server...
Model Costs ($0.0057 / 0.43 credits total)
Service Model Input Tokens Input Cost Output Tokens Output Cost Total Cost Total Credits
anthropic claude-3-7-sonnet-20250219 311 $0.0010 159 $0.0024 $0.0034 0.26
google gemini-1.5-flash 248 $0.0001 100 $0.0001 $0.0002 0.02
openai gpt-4o 265 $0.0008 124 $0.0013 $0.0021 0.15
Totals 824 $0.0019 383 $0.0038 $0.0057 0.43

You can obtain the total credit cost by multiplying the total USD cost by 100. A lower credit cost indicates that you saved money by retrieving responses from the universal remote cache.

[6]:
results.select("model", "topic", "haiku", "originality")
[6]:
  model.model scenario.topic answer.haiku answer.originality
0 claude-3-7-sonnet-20250219 winter Snowflakes drift downward Blanket of white hides the earth Silence embraces 2
1 gemini-1.5-flash winter White breath in the air, Frozen ground crunches below, Silence blankets all. 2
2 gpt-4o winter Snow blankets the earth, Silent whispers fill the air, Cold breath of winter. 2
3 claude-3-7-sonnet-20250219 language models Words dance in code, Patterns weave through silicon— Echoes of our thoughts. 4
4 gemini-1.5-flash language models Data flows like streams, Words bloom, a digital flower, Meaning takes its form. 2
5 gpt-4o language models Words dance in silence, Patterns weave through vast data— Machines learn to speak. 4

Next we prompt each model to rate every haiku

We modify the second question to use a scenario for each haiku instead of piping the answer from the first question (i.e., {{ haiku.answer }} is changed to {{ scenario.haiku }}):

[7]:
new_q = QuestionLinearScale(
    question_name = "originality",
    question_text = "On a scale from 1 to 5, please rate the originality of this haiku: {{ scenario.haiku }}.",
    question_options = [1,2,3,4,5],
    option_labels = {1:"Totally unoriginal", 5:"Highly original"}
)
[8]:
haikus = results.select("model", "topic", "haiku").to_scenario_list().rename({"model":"drafting_model"})
haikus
[8]:

ScenarioList scenarios: 6; keys: ['haiku', 'drafting_model', 'topic'];

  drafting_model topic haiku
0 claude-3-7-sonnet-20250219 winter Snowflakes drift downward Blanket of white hides the earth Silence embraces
1 gemini-1.5-flash winter White breath in the air, Frozen ground crunches below, Silence blankets all.
2 gpt-4o winter Snow blankets the earth, Silent whispers fill the air, Cold breath of winter.
3 claude-3-7-sonnet-20250219 language models Words dance in code, Patterns weave through silicon— Echoes of our thoughts.
4 gemini-1.5-flash language models Data flows like streams, Words bloom, a digital flower, Meaning takes its form.
5 gpt-4o language models Words dance in silence, Patterns weave through vast data— Machines learn to speak.
[9]:
new_results = new_q.by(haikus).by(m).run()
Job Status 🦜
Completed (18 completed, 0 failed)
Identifiers
Results UUID:
6c2fdd89...fd76
Use Results.pull(uuid) to fetch results.
Job UUID:
31c3f45d...7bfe
Use Jobs.pull(uuid) to fetch job.
Status: Completed
Last updated: 2025-06-07 17:51:54
17:51:54
Job completed and Results stored on Coop. View Results
17:51:49
Job status: queued - last update: 2025-06-07 05:51:49 PM
17:51:49
View job progress here
17:51:49
Job details are available at your Coop account. Go to Remote Inference page
17:51:49
Job sent to server. (Job uuid=31c3f45d-a916-47ac-a1b9-9804f8ee7bfe).
17:51:49
Your survey is running at the Expected Parrot server...
17:51:48
Remote inference activated. Sending job to server...
Model Costs ($0.0125 / 0.00 credits total)
Service Model Input Tokens Input Cost Output Tokens Output Cost Total Cost Total Credits
anthropic claude-3-7-sonnet-20250219 839 $0.0026 350 $0.0053 $0.0079 0.00
google gemini-1.5-flash 699 $0.0001 229 $0.0001 $0.0002 0.00
openai gpt-4o 697 $0.0018 255 $0.0026 $0.0044 0.00
Totals 2,235 $0.0045 834 $0.0080 $0.0125 0.00

You can obtain the total credit cost by multiplying the total USD cost by 100. A lower credit cost indicates that you saved money by retrieving responses from the universal remote cache.

[10]:
(
    new_results
    .sort_by("topic", "drafting_model", "model")
    .select("model", "drafting_model", "topic", "haiku", "originality")
)
[10]:
  model.model scenario.drafting_model scenario.topic scenario.haiku answer.originality
0 claude-3-7-sonnet-20250219 claude-3-7-sonnet-20250219 language models Words dance in code, Patterns weave through silicon— Echoes of our thoughts. 4
1 gemini-1.5-flash claude-3-7-sonnet-20250219 language models Words dance in code, Patterns weave through silicon— Echoes of our thoughts. 3
2 gpt-4o claude-3-7-sonnet-20250219 language models Words dance in code, Patterns weave through silicon— Echoes of our thoughts. 4
3 claude-3-7-sonnet-20250219 gemini-1.5-flash language models Data flows like streams, Words bloom, a digital flower, Meaning takes its form. 3
4 gemini-1.5-flash gemini-1.5-flash language models Data flows like streams, Words bloom, a digital flower, Meaning takes its form. 2
5 gpt-4o gemini-1.5-flash language models Data flows like streams, Words bloom, a digital flower, Meaning takes its form. 3
6 claude-3-7-sonnet-20250219 gpt-4o language models Words dance in silence, Patterns weave through vast data— Machines learn to speak. 4
7 gemini-1.5-flash gpt-4o language models Words dance in silence, Patterns weave through vast data— Machines learn to speak. 3
8 gpt-4o gpt-4o language models Words dance in silence, Patterns weave through vast data— Machines learn to speak. 4
9 claude-3-7-sonnet-20250219 claude-3-7-sonnet-20250219 winter Snowflakes drift downward Blanket of white hides the earth Silence embraces 2
10 gemini-1.5-flash claude-3-7-sonnet-20250219 winter Snowflakes drift downward Blanket of white hides the earth Silence embraces 2
11 gpt-4o claude-3-7-sonnet-20250219 winter Snowflakes drift downward Blanket of white hides the earth Silence embraces 2
12 claude-3-7-sonnet-20250219 gemini-1.5-flash winter White breath in the air, Frozen ground crunches below, Silence blankets all. 3
13 gemini-1.5-flash gemini-1.5-flash winter White breath in the air, Frozen ground crunches below, Silence blankets all. 2
14 gpt-4o gemini-1.5-flash winter White breath in the air, Frozen ground crunches below, Silence blankets all. 2
15 claude-3-7-sonnet-20250219 gpt-4o winter Snow blankets the earth, Silent whispers fill the air, Cold breath of winter. 2
16 gemini-1.5-flash gpt-4o winter Snow blankets the earth, Silent whispers fill the air, Cold breath of winter. 2
17 gpt-4o gpt-4o winter Snow blankets the earth, Silent whispers fill the air, Cold breath of winter. 2

Posting this notebook to Coop

[11]:
# from edsl import Notebook

# nb = Notebook(path = "models_scoring_models.ipynb")

# nb.push(
#     description = "Models scoring models",
#     alias = "models-scoring-models-notebook",
#     visibility = "public"
# )

Updating an object at Coop:

[ ]:
from edsl import Notebook

nb = Notebook(path = "models_scoring_models.ipynb") # resave

nb.patch("https://www.expectedparrot.com/content/RobinHorton/models-scoring-models-notebook", value = nb)