Comparing model performance

In this notebook we show how to use EDSL to prompt a set of models to answer the same survey at once and compare their responses. We also demonstrate how to prompt models to evaluate the content they have generated.

[1]:

from edsl import Model, ModelList, ScenarioList, QuestionFreeText, QuestionLinearScale, Survey

[2]:

m = ModelList([
    Model("claude-3-7-sonnet-20250219", service_name = "anthropic"),
    Model("gemini-1.5-flash", service_name = "google"),
    Model("gpt-4o", service_name = "openai")
])

[3]:

s = ScenarioList.from_source("list", "topic", ["winter", "language models"])

[4]:

q1 = QuestionFreeText(
    question_name = "haiku",
    question_text = "Please draft a haiku about {{ scenario.topic }}."
)

q2 = QuestionLinearScale(
    question_name = "originality",
    question_text = "On a scale from 1 to 5, please rate the originality of this haiku: {{ haiku.answer }}.",
    question_options = [1,2,3,4,5],
    option_labels = {1:"Totally unoriginal", 5:"Highly original"}
)

survey = Survey(questions = [q1, q2])

survey

[4]:

Survey # questions: 2; question_name list: ['haiku', 'originality'];

	option_labels	question_text	question_name	question_options	question_type
0	nan	Please draft a haiku about {{ scenario.topic }}.	haiku	nan	free_text
1	{1: 'Totally unoriginal', 5: 'Highly original'}	On a scale from 1 to 5, please rate the originality of this haiku: {{ haiku.answer }}.	originality	[1, 2, 3, 4, 5]	linear_scale

[5]:

results = survey.by(s).by(m).run()

⌃ Job Status 🦜

Completed (6 completed, 0 failed)

Job Links

Results

Progress Report

Content

Remote Jobs

Remote Cache

Identifiers

Results UUID:

2990fe19...8aa5

Use Results.pull(uuid) to fetch results.

Job UUID:

d5494839...0c4c

Use Jobs.pull(uuid) to fetch job.

17:51:47

Job completed and Results stored on Coop. View Results

17:51:42

Job status: running - last update: 2025-06-07 05:51:42 PM

17:51:38

Job status: queued - last update: 2025-06-07 05:51:38 PM

17:51:37

View job progress here

17:51:37

Job details are available at your Coop account. Go to Remote Inference page

17:51:37

Job sent to server. (Job uuid=d5494839-8cfe-41cb-97fa-4fbe39040c4c).

17:51:37

Your survey is running at the Expected Parrot server...

17:51:36

Remote inference activated. Sending job to server...

⌃ Model Costs ($0.0057 / 0.43 credits total)

Service	Model	Input Tokens	Input Cost	Output Tokens	Output Cost	Total Cost	Total Credits
anthropic	claude-3-7-sonnet-20250219	311	$0.0010	159	$0.0024	$0.0034	0.26
google	gemini-1.5-flash	248	$0.0001	100	$0.0001	$0.0002	0.02
openai	gpt-4o	265	$0.0008	124	$0.0013	$0.0021	0.15
Totals		824	$0.0019	383	$0.0038	$0.0057	0.43

You can obtain the total credit cost by multiplying the total USD cost by 100. A lower credit cost indicates that you saved money by retrieving responses from the universal remote cache.

[6]:

results.select("model", "topic", "haiku", "originality")

[6]:

	model.model	scenario.topic	answer.haiku	answer.originality
0	claude-3-7-sonnet-20250219	winter	Snowflakes drift downward Blanket of white hides the earth Silence embraces	2
1	gemini-1.5-flash	winter	White breath in the air, Frozen ground crunches below, Silence blankets all.	2
2	gpt-4o	winter	Snow blankets the earth, Silent whispers fill the air, Cold breath of winter.	2
3	claude-3-7-sonnet-20250219	language models	Words dance in code, Patterns weave through silicon— Echoes of our thoughts.	4
4	gemini-1.5-flash	language models	Data flows like streams, Words bloom, a digital flower, Meaning takes its form.	2
5	gpt-4o	language models	Words dance in silence, Patterns weave through vast data— Machines learn to speak.	4

Next we prompt each model to rate every haiku

We modify the second question to use a scenario for each haiku instead of piping the answer from the first question (i.e., {{ haiku.answer }} is changed to {{ scenario.haiku }}):

[7]:

new_q = QuestionLinearScale(
    question_name = "originality",
    question_text = "On a scale from 1 to 5, please rate the originality of this haiku: {{ scenario.haiku }}.",
    question_options = [1,2,3,4,5],
    option_labels = {1:"Totally unoriginal", 5:"Highly original"}
)

[8]:

haikus = results.select("model", "topic", "haiku").to_scenario_list().rename({"model":"drafting_model"})
haikus

[8]:

ScenarioList scenarios: 6; keys: ['haiku', 'drafting_model', 'topic'];

	drafting_model	topic	haiku
0	claude-3-7-sonnet-20250219	winter	Snowflakes drift downward Blanket of white hides the earth Silence embraces
1	gemini-1.5-flash	winter	White breath in the air, Frozen ground crunches below, Silence blankets all.
2	gpt-4o	winter	Snow blankets the earth, Silent whispers fill the air, Cold breath of winter.
3	claude-3-7-sonnet-20250219	language models	Words dance in code, Patterns weave through silicon— Echoes of our thoughts.
4	gemini-1.5-flash	language models	Data flows like streams, Words bloom, a digital flower, Meaning takes its form.
5	gpt-4o	language models	Words dance in silence, Patterns weave through vast data— Machines learn to speak.

[9]:

new_results = new_q.by(haikus).by(m).run()

⌃ Job Status 🦜

Completed (18 completed, 0 failed)

Job Links

Results

Progress Report

Content

Remote Jobs

Remote Cache

Identifiers

Results UUID:

6c2fdd89...fd76

Use Results.pull(uuid) to fetch results.

Job UUID:

31c3f45d...7bfe

Use Jobs.pull(uuid) to fetch job.

17:51:54

Job completed and Results stored on Coop. View Results

17:51:49

Job status: queued - last update: 2025-06-07 05:51:49 PM

17:51:49

View job progress here

17:51:49

Job details are available at your Coop account. Go to Remote Inference page

17:51:49

Job sent to server. (Job uuid=31c3f45d-a916-47ac-a1b9-9804f8ee7bfe).

17:51:49

Your survey is running at the Expected Parrot server...

17:51:48

Remote inference activated. Sending job to server...

⌃ Model Costs ($0.0125 / 0.00 credits total)

Service	Model	Input Tokens	Input Cost	Output Tokens	Output Cost	Total Cost	Total Credits
anthropic	claude-3-7-sonnet-20250219	839	$0.0026	350	$0.0053	$0.0079	0.00
google	gemini-1.5-flash	699	$0.0001	229	$0.0001	$0.0002	0.00
openai	gpt-4o	697	$0.0018	255	$0.0026	$0.0044	0.00
Totals		2,235	$0.0045	834	$0.0080	$0.0125	0.00

You can obtain the total credit cost by multiplying the total USD cost by 100. A lower credit cost indicates that you saved money by retrieving responses from the universal remote cache.

[10]:

(
    new_results
    .sort_by("topic", "drafting_model", "model")
    .select("model", "drafting_model", "topic", "haiku", "originality")
)

[10]:

	model.model	scenario.drafting_model	scenario.topic	scenario.haiku	answer.originality
0	claude-3-7-sonnet-20250219	claude-3-7-sonnet-20250219	language models	Words dance in code, Patterns weave through silicon— Echoes of our thoughts.	4
1	gemini-1.5-flash	claude-3-7-sonnet-20250219	language models	Words dance in code, Patterns weave through silicon— Echoes of our thoughts.	3
2	gpt-4o	claude-3-7-sonnet-20250219	language models	Words dance in code, Patterns weave through silicon— Echoes of our thoughts.	4
3	claude-3-7-sonnet-20250219	gemini-1.5-flash	language models	Data flows like streams, Words bloom, a digital flower, Meaning takes its form.	3
4	gemini-1.5-flash	gemini-1.5-flash	language models	Data flows like streams, Words bloom, a digital flower, Meaning takes its form.	2
5	gpt-4o	gemini-1.5-flash	language models	Data flows like streams, Words bloom, a digital flower, Meaning takes its form.	3
6	claude-3-7-sonnet-20250219	gpt-4o	language models	Words dance in silence, Patterns weave through vast data— Machines learn to speak.	4
7	gemini-1.5-flash	gpt-4o	language models	Words dance in silence, Patterns weave through vast data— Machines learn to speak.	3
8	gpt-4o	gpt-4o	language models	Words dance in silence, Patterns weave through vast data— Machines learn to speak.	4
9	claude-3-7-sonnet-20250219	claude-3-7-sonnet-20250219	winter	Snowflakes drift downward Blanket of white hides the earth Silence embraces	2
10	gemini-1.5-flash	claude-3-7-sonnet-20250219	winter	Snowflakes drift downward Blanket of white hides the earth Silence embraces	2
11	gpt-4o	claude-3-7-sonnet-20250219	winter	Snowflakes drift downward Blanket of white hides the earth Silence embraces	2
12	claude-3-7-sonnet-20250219	gemini-1.5-flash	winter	White breath in the air, Frozen ground crunches below, Silence blankets all.	3
13	gemini-1.5-flash	gemini-1.5-flash	winter	White breath in the air, Frozen ground crunches below, Silence blankets all.	2
14	gpt-4o	gemini-1.5-flash	winter	White breath in the air, Frozen ground crunches below, Silence blankets all.	2
15	claude-3-7-sonnet-20250219	gpt-4o	winter	Snow blankets the earth, Silent whispers fill the air, Cold breath of winter.	2
16	gemini-1.5-flash	gpt-4o	winter	Snow blankets the earth, Silent whispers fill the air, Cold breath of winter.	2
17	gpt-4o	gpt-4o	winter	Snow blankets the earth, Silent whispers fill the air, Cold breath of winter.	2

Posting this notebook to Coop

[11]:

# from edsl import Notebook

# nb = Notebook(path = "models_scoring_models.ipynb")

# nb.push(
#     description = "Models scoring models",
#     alias = "models-scoring-models-notebook",
#     visibility = "public"
# )

Updating an object at Coop:

[ ]:

from edsl import Notebook

nb = Notebook(path = "models_scoring_models.ipynb") # resave

nb.patch("https://www.expectedparrot.com/content/RobinHorton/models-scoring-models-notebook", value = nb)