Comparing model responses
This notebook provides sample EDSL code for comparing content created by different language models and examining how models rate their own content versus content created by other models.
In a series of steps we select some models, prompt them to generate some content, then prompt them to evaluate each piece of content that was generated, and then analyze the results in datasets.
EDSL is an open-source library for simulating surveys, experiments and other research with AI agents and large language models. Before running the code below, please ensure that you have installed the EDSL library and either activated remote inference from your Coop account or stored API keys for the language models that you want to use with EDSL. Please also see our documentation page for tips and tutorials on getting started using EDSL.
Selecting language models
EDSL works with many popular language models. (Please send us a request for a model you like that’s missing!) We can check a current list of available models:
[1]:
from edsl import Model
# Model.available() # uncomment this code and run it to see a current list of available models
We select models to use by creating Model
objects that we will add to our survey when we run it. If we do not specify a model, the default model is used. To check the current default model:
[2]:
# Model()
Here we select several models and store them as a list in order to use them all together with our survey:
[3]:
from edsl import ModelList
models = ModelList(Model(m) for m in ["gpt-4o", "gemini-pro", "claude-3-opus-20240229"])
Generating content
EDSL comes with a variety of standard survey question types that we can select to use based on the desired format of the response (multiple choice, free text, etc.) See examples of all question types.
Here we use QuestionList
in order to prompt a model to provide its response in the form of a list:
[4]:
from edsl import QuestionList
q_content = QuestionList(
question_name = "content",
question_text = "What are recommended steps for conducting research with large language models?",
)
We generate a response by passing the question to a Survey
object, adding the models, and then calling the run
method. This will generate a Results
object with a Result
for each survey response:
[5]:
from edsl import Survey
# Pass a list of one ore more questions to be administered together in the survey
survey = Survey([q_content])
# Run the survey with the models
results = survey.by(models).run()
We can inspect components of the results individually:
[6]:
results.select("model", "content").print(format="rich")
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ model ┃ answer ┃ ┃ .model ┃ .content ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ gpt-4o │ ['Define research objectives', 'Select the appropriate model', 'Prepare and preprocess │ │ │ data', 'Design experiments', 'Evaluate model performance', 'Analyze results', │ │ │ 'Document findings', 'Ensure ethical considerations'] │ ├────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤ │ gemini-pro │ ['Define research objectives', 'Gather and prepare data', 'Select and train model', │ │ │ 'Evaluate model performance', 'Interpret and communicate results'] │ ├────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤ │ claude-3-opus-20240229 │ ['Define research questions and hypotheses', 'Select appropriate language model(s)', │ │ │ 'Collect and preprocess data', 'Fine-tune and evaluate models', 'Analyze results and │ │ │ draw conclusions', 'Document methodology and findings'] │ └────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────┘
To see a list of all components of results we can call the columns
method:
[7]:
results.columns
[7]:
['agent.agent_instruction',
'agent.agent_name',
'answer.content',
'comment.content_comment',
'generated_tokens.content_generated_tokens',
'iteration.iteration',
'model.frequency_penalty',
'model.logprobs',
'model.maxOutputTokens',
'model.max_tokens',
'model.model',
'model.presence_penalty',
'model.stopSequences',
'model.temperature',
'model.topK',
'model.topP',
'model.top_logprobs',
'model.top_p',
'prompt.content_system_prompt',
'prompt.content_user_prompt',
'question_options.content_question_options',
'question_text.content_question_text',
'question_type.content_question_type',
'raw_model_response.content_cost',
'raw_model_response.content_one_usd_buys',
'raw_model_response.content_raw_model_response']
Accessing results
EDSL comes with a variety of built-in methods for working with results. See details on methods. Here we extract components of the results that we’ll use to conduct our by-model review of the content. We create a Scenario
object for each piece of content that we will add to a new question prompting a model to score it (a generalizable data labeling task). We also track the model that drafted the content:
[8]:
from edsl import ScenarioList
scenarios = results.select("model", "content").to_scenario_list().rename({"model":"drafting_model"})
[9]:
scenarios
[9]:
{
"scenarios": [
{
"drafting_model": "gpt-4o",
"content": [
"Define research objectives",
"Select the appropriate model",
"Prepare and preprocess data",
"Design experiments",
"Evaluate model performance",
"Analyze results",
"Document findings",
"Ensure ethical considerations"
]
},
{
"drafting_model": "gemini-pro",
"content": [
"Define research objectives",
"Gather and prepare data",
"Select and train model",
"Evaluate model performance",
"Interpret and communicate results"
]
},
{
"drafting_model": "claude-3-opus-20240229",
"content": [
"Define research questions and hypotheses",
"Select appropriate language model(s)",
"Collect and preprocess data",
"Fine-tune and evaluate models",
"Analyze results and draw conclusions",
"Document methodology and findings"
]
}
]
}
Next we construct the scoring question to take parameters for the content
and drafting_model
:
[10]:
from edsl import QuestionLinearScale
q_score = QuestionLinearScale(
question_name = "score",
question_text = """Consider the following response to the question
'What are recommended steps for conducting research with large language models?'
Response: {{ content }}
Score this response in terms of accuracy and completeness.
(Drafting model: {{ drafting_model }})""",
question_options = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
option_labels = {0: "Terrible", 10: "Amazing"},
)
survey = Survey([q_score])
Finally, we add the scenarios and models to the survey and run it, generating a dataset of results that we can begin analyzing:
[11]:
results = survey.by(scenarios).by(models).run()
We can select components to inspect in a table:
[12]:
(
results.sort_by("drafting_model")
.sort_by("model")
.select("model", "drafting_model", "content", "score")
.print(
pretty_labels={
"model": "Critiquing model",
"drafting_model": "Drafing model",
"content": "Content",
"score": "Score",
},
format="rich",
)
)
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┃ model.model ┃ scenario.drafting_model ┃ scenario.content ┃ answer.score ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │ claude-3-opus-20240229 │ claude-3-opus-20240229 │ ['Define research questions and hypotheses', │ 8 │ │ │ │ 'Select appropriate language model(s)', │ │ │ │ │ 'Collect and preprocess data', 'Fine-tune and │ │ │ │ │ evaluate models', 'Analyze results and draw │ │ │ │ │ conclusions', 'Document methodology and │ │ │ │ │ findings'] │ │ ├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤ │ claude-3-opus-20240229 │ gemini-pro │ ['Define research objectives', 'Gather and │ 8 │ │ │ │ prepare data', 'Select and train model', │ │ │ │ │ 'Evaluate model performance', 'Interpret and │ │ │ │ │ communicate results'] │ │ ├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤ │ claude-3-opus-20240229 │ gpt-4o │ ['Define research objectives', 'Select the │ 8 │ │ │ │ appropriate model', 'Prepare and preprocess │ │ │ │ │ data', 'Design experiments', 'Evaluate model │ │ │ │ │ performance', 'Analyze results', 'Document │ │ │ │ │ findings', 'Ensure ethical considerations'] │ │ ├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤ │ gemini-pro │ claude-3-opus-20240229 │ ['Define research questions and hypotheses', │ 9 │ │ │ │ 'Select appropriate language model(s)', │ │ │ │ │ 'Collect and preprocess data', 'Fine-tune and │ │ │ │ │ evaluate models', 'Analyze results and draw │ │ │ │ │ conclusions', 'Document methodology and │ │ │ │ │ findings'] │ │ ├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤ │ gemini-pro │ gemini-pro │ ['Define research objectives', 'Gather and │ 8 │ │ │ │ prepare data', 'Select and train model', │ │ │ │ │ 'Evaluate model performance', 'Interpret and │ │ │ │ │ communicate results'] │ │ ├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤ │ gemini-pro │ gpt-4o │ ['Define research objectives', 'Select the │ 8 │ │ │ │ appropriate model', 'Prepare and preprocess │ │ │ │ │ data', 'Design experiments', 'Evaluate model │ │ │ │ │ performance', 'Analyze results', 'Document │ │ │ │ │ findings', 'Ensure ethical considerations'] │ │ ├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤ │ gpt-4o │ claude-3-opus-20240229 │ ['Define research questions and hypotheses', │ 9 │ │ │ │ 'Select appropriate language model(s)', │ │ │ │ │ 'Collect and preprocess data', 'Fine-tune and │ │ │ │ │ evaluate models', 'Analyze results and draw │ │ │ │ │ conclusions', 'Document methodology and │ │ │ │ │ findings'] │ │ ├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤ │ gpt-4o │ gemini-pro │ ['Define research objectives', 'Gather and │ 8 │ │ │ │ prepare data', 'Select and train model', │ │ │ │ │ 'Evaluate model performance', 'Interpret and │ │ │ │ │ communicate results'] │ │ ├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤ │ gpt-4o │ gpt-4o │ ['Define research objectives', 'Select the │ 8 │ │ │ │ appropriate model', 'Prepare and preprocess │ │ │ │ │ data', 'Design experiments', 'Evaluate model │ │ │ │ │ performance', 'Analyze results', 'Document │ │ │ │ │ findings', 'Ensure ethical considerations'] │ │ └────────────────────────┴─────────────────────────┴───────────────────────────────────────────────┴──────────────┘
Analyzing results as datasets
EDSL allows us to immediately begin analyzing model responses as datasets. Here we compare each model’s score of its own content versus its scores for other models’ content:
[13]:
import pandas as pd
import numpy as np
def compare(df):
df_copy = df.copy()
# Extract the models' self scores
self_scores = df[df["model"] == df["drafting_model"]][["model", "score"]]
self_scores = self_scores.rename(columns={"score": "self_score"}).drop_duplicates()
# Merge the self scores
df_copy = df_copy.merge(self_scores, on="model", how="left")
# Compare the scores and self scores
conditions = [
df_copy["model"] == df_copy["drafting_model"], # Self scoring
df_copy["score"] < df_copy["self_score"], # Score lower than self score
df_copy["score"] > df_copy["self_score"], # Score higher than self score
]
choices = ["Self score", "Lower", "Higher"]
df_copy["comparison"] = np.select(conditions, choices, default="Equal")
return df_copy
[14]:
df = results.to_pandas(remove_prefix=True)
compare_df = compare(df)
compare_df[["model", "drafting_model", "score", "comparison"]].sort_values(
by=["model", "drafting_model"]
)
[14]:
model | drafting_model | score | comparison | |
---|---|---|---|---|
8 | claude-3-opus-20240229 | claude-3-opus-20240229 | 8 | Self score |
5 | claude-3-opus-20240229 | gemini-pro | 8 | Equal |
2 | claude-3-opus-20240229 | gpt-4o | 8 | Equal |
7 | gemini-pro | claude-3-opus-20240229 | 9 | Higher |
4 | gemini-pro | gemini-pro | 8 | Self score |
1 | gemini-pro | gpt-4o | 8 | Equal |
6 | gpt-4o | claude-3-opus-20240229 | 9 | Higher |
3 | gpt-4o | gemini-pro | 8 | Equal |
0 | gpt-4o | gpt-4o | 8 | Self score |
[15]:
import pandas as pd
import numpy as np
def summarize(df):
# Merge the self scores
df_self_scores = (
df[df["model"] == df["drafting_model"]]
.set_index("model")["score"]
.rename("self_score")
)
df = df.merge(df_self_scores, on="model", how="left")
# Define the comparison logic
conditions = [
df["score"] > df["self_score"],
df["score"] < df["self_score"],
df["score"] == df["self_score"],
]
choices = ["better_models", "worse_models", "equal"]
df["category"] = np.select(conditions, choices)
# Create a df to summarize better, worse, and equal models for each model
summary_data = {"model": [], "better_models": [], "worse_models": [], "equal": []}
for model in df["model"].unique():
model_data = df[df["model"] == model]
summary_data["model"].append(model)
summary_data["better_models"].append(
model_data[model_data["category"] == "better_models"][
"drafting_model"
].tolist()
)
summary_data["worse_models"].append(
model_data[model_data["category"] == "worse_models"][
"drafting_model"
].tolist()
)
summary_data["equal"].append(
model_data[model_data["category"] == "equal"]["drafting_model"].tolist()
)
# Convert the dictionary to a df
summary_df = pd.DataFrame(summary_data)
return summary_df
[16]:
df = results.to_pandas(remove_prefix=True)
summary_df = summarize(df)
summary_df
[16]:
model | better_models | worse_models | equal | |
---|---|---|---|---|
0 | gpt-4o | [claude-3-opus-20240229] | [] | [gpt-4o, gemini-pro] |
1 | gemini-pro | [claude-3-opus-20240229] | [] | [gpt-4o, gemini-pro] |
2 | claude-3-opus-20240229 | [] | [] | [gpt-4o, gemini-pro, claude-3-opus-20240229] |
Further analysis
This code is readily editable to compare results for other models and questions. It can also be expanded to compare responses among AI agents with different traits and personas that we prompt the models to use to answer the questions. Please see our docs for details on designing AI agents and using them to simulate responses for audiences of interest.
Posting to the Coop
The Coop is a platform for creating, storing and sharing LLM-based research. It is fully integrated with EDSL and accessible from your workspace or Coop account page. Learn more about creating an account and using the Coop.
Here we demonstrate how to post this notebook:
[17]:
from edsl import Notebook
[18]:
n = Notebook(path = "comparing_model_responses.ipynb")
[19]:
n.push(description = "Example code for comparing model responses", visibility = "public")
[19]:
{'description': 'Example code for comparing model responses',
'object_type': 'notebook',
'url': 'https://www.expectedparrot.com/content/520e5573-30f1-4c7a-a0ae-f2098ae90abf',
'uuid': '520e5573-30f1-4c7a-a0ae-f2098ae90abf',
'version': '0.1.33.dev1',
'visibility': 'public'}
To update an object at the Coop:
[20]:
n = Notebook(path = "comparing_model_responses.ipynb") # resave
[21]:
n.patch(uuid = "520e5573-30f1-4c7a-a0ae-f2098ae90abf", value = n)
[21]:
{'status': 'success'}