Comparing model responses

This notebook provides sample EDSL code for comparing content created by different language models and examining how models rate their own content versus content created by other models.

In a series of steps we select some models, prompt them to generate some content, then prompt them to evaluate each piece of content that was generated, and then analyze the results in datasets.

EDSL is an open-source library for simulating surveys, experiments and other research with AI agents and large language models. Before running the code below, please ensure that you have installed the EDSL library and either activated remote inference from your Coop account or stored API keys for the language models that you want to use with EDSL. Please also see our documentation page for tips and tutorials on getting started using EDSL.

Selecting language models

EDSL works with many popular language models. (Please send us a request for a model you like that’s missing!) We can check a current list of available models:

[1]:

from edsl import Model

# Model.available() # uncomment this code and run it to see a current list of available models

We select models to use by creating Model objects that we will add to our survey when we run it. If we do not specify a model, the default model is used. To check the current default model:

[2]:

# Model()

Here we select several models and store them as a list in order to use them all together with our survey:

[3]:

from edsl import ModelList

models = ModelList(Model(m) for m in ["gpt-4o", "gemini-pro", "claude-3-opus-20240229"])

Generating content

EDSL comes with a variety of standard survey question types that we can select to use based on the desired format of the response (multiple choice, free text, etc.) See examples of all question types.

Here we use QuestionList in order to prompt a model to provide its response in the form of a list:

[4]:

from edsl import QuestionList

q_content = QuestionList(
    question_name = "content",
    question_text = "What are recommended steps for conducting research with large language models?",
)

We generate a response by passing the question to a Survey object, adding the models, and then calling the run method. This will generate a Results object with a Result for each survey response:

[5]:

from edsl import Survey

# Pass a list of one ore more questions to be administered together in the survey
survey = Survey([q_content])

# Run the survey with the models
results = survey.by(models).run()

We can inspect components of the results individually:

[6]:

results.select("model", "content").print(format="rich")

┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ model                  ┃ answer                                                                                 ┃
┃ .model                 ┃ .content                                                                               ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ gpt-4o                 │ ['Define research objectives', 'Select the appropriate model', 'Prepare and preprocess │
│                        │ data', 'Design experiments', 'Evaluate model performance', 'Analyze results',          │
│                        │ 'Document findings', 'Ensure ethical considerations']                                  │
├────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ gemini-pro             │ ['Define research objectives', 'Gather and prepare data', 'Select and train model',    │
│                        │ 'Evaluate model performance', 'Interpret and communicate results']                     │
├────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ claude-3-opus-20240229 │ ['Define research questions and hypotheses', 'Select appropriate language model(s)',   │
│                        │ 'Collect and preprocess data', 'Fine-tune and evaluate models', 'Analyze results and   │
│                        │ draw conclusions', 'Document methodology and findings']                                │
└────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────┘

To see a list of all components of results we can call the columns method:

[7]:

results.columns

[7]:

['agent.agent_instruction',
 'agent.agent_name',
 'answer.content',
 'comment.content_comment',
 'generated_tokens.content_generated_tokens',
 'iteration.iteration',
 'model.frequency_penalty',
 'model.logprobs',
 'model.maxOutputTokens',
 'model.max_tokens',
 'model.model',
 'model.presence_penalty',
 'model.stopSequences',
 'model.temperature',
 'model.topK',
 'model.topP',
 'model.top_logprobs',
 'model.top_p',
 'prompt.content_system_prompt',
 'prompt.content_user_prompt',
 'question_options.content_question_options',
 'question_text.content_question_text',
 'question_type.content_question_type',
 'raw_model_response.content_cost',
 'raw_model_response.content_one_usd_buys',
 'raw_model_response.content_raw_model_response']

Accessing results

EDSL comes with a variety of built-in methods for working with results. See details on methods. Here we extract components of the results that we’ll use to conduct our by-model review of the content. We create a Scenario object for each piece of content that we will add to a new question prompting a model to score it (a generalizable data labeling task). We also track the model that drafted the content:

[8]:

from edsl import ScenarioList

scenarios = results.select("model", "content").to_scenario_list().rename({"model":"drafting_model"})

[9]:

scenarios

[9]:

{
    "scenarios": [
        {
            "drafting_model": "gpt-4o",
            "content": [
                "Define research objectives",
                "Select the appropriate model",
                "Prepare and preprocess data",
                "Design experiments",
                "Evaluate model performance",
                "Analyze results",
                "Document findings",
                "Ensure ethical considerations"
            ]
        },
        {
            "drafting_model": "gemini-pro",
            "content": [
                "Define research objectives",
                "Gather and prepare data",
                "Select and train model",
                "Evaluate model performance",
                "Interpret and communicate results"
            ]
        },
        {
            "drafting_model": "claude-3-opus-20240229",
            "content": [
                "Define research questions and hypotheses",
                "Select appropriate language model(s)",
                "Collect and preprocess data",
                "Fine-tune and evaluate models",
                "Analyze results and draw conclusions",
                "Document methodology and findings"
            ]
        }
    ]
}

Next we construct the scoring question to take parameters for the content and drafting_model:

[10]:

from edsl import QuestionLinearScale

q_score = QuestionLinearScale(
    question_name = "score",
    question_text = """Consider the following response to the question
    'What are recommended steps for conducting research with large language models?'
    Response: {{ content }}
    Score this response in terms of accuracy and completeness.
    (Drafting model: {{ drafting_model }})""",
    question_options = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    option_labels = {0: "Terrible", 10: "Amazing"},
)

survey = Survey([q_score])

Finally, we add the scenarios and models to the survey and run it, generating a dataset of results that we can begin analyzing:

[11]:

results = survey.by(scenarios).by(models).run()

We can select components to inspect in a table:

[12]:

(
    results.sort_by("drafting_model")
    .sort_by("model")
    .select("model", "drafting_model", "content", "score")
    .print(
        pretty_labels={
            "model": "Critiquing model",
            "drafting_model": "Drafing model",
            "content": "Content",
            "score": "Score",
        },
        format="rich",
    )
)

┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ model.model            ┃ scenario.drafting_model ┃ scenario.content                              ┃ answer.score ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ claude-3-opus-20240229 │ claude-3-opus-20240229  │ ['Define research questions and hypotheses',  │ 8            │
│                        │                         │ 'Select appropriate language model(s)',       │              │
│                        │                         │ 'Collect and preprocess data', 'Fine-tune and │              │
│                        │                         │ evaluate models', 'Analyze results and draw   │              │
│                        │                         │ conclusions', 'Document methodology and       │              │
│                        │                         │ findings']                                    │              │
├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤
│ claude-3-opus-20240229 │ gemini-pro              │ ['Define research objectives', 'Gather and    │ 8            │
│                        │                         │ prepare data', 'Select and train model',      │              │
│                        │                         │ 'Evaluate model performance', 'Interpret and  │              │
│                        │                         │ communicate results']                         │              │
├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤
│ claude-3-opus-20240229 │ gpt-4o                  │ ['Define research objectives', 'Select the    │ 8            │
│                        │                         │ appropriate model', 'Prepare and preprocess   │              │
│                        │                         │ data', 'Design experiments', 'Evaluate model  │              │
│                        │                         │ performance', 'Analyze results', 'Document    │              │
│                        │                         │ findings', 'Ensure ethical considerations']   │              │
├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤
│ gemini-pro             │ claude-3-opus-20240229  │ ['Define research questions and hypotheses',  │ 9            │
│                        │                         │ 'Select appropriate language model(s)',       │              │
│                        │                         │ 'Collect and preprocess data', 'Fine-tune and │              │
│                        │                         │ evaluate models', 'Analyze results and draw   │              │
│                        │                         │ conclusions', 'Document methodology and       │              │
│                        │                         │ findings']                                    │              │
├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤
│ gemini-pro             │ gemini-pro              │ ['Define research objectives', 'Gather and    │ 8            │
│                        │                         │ prepare data', 'Select and train model',      │              │
│                        │                         │ 'Evaluate model performance', 'Interpret and  │              │
│                        │                         │ communicate results']                         │              │
├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤
│ gemini-pro             │ gpt-4o                  │ ['Define research objectives', 'Select the    │ 8            │
│                        │                         │ appropriate model', 'Prepare and preprocess   │              │
│                        │                         │ data', 'Design experiments', 'Evaluate model  │              │
│                        │                         │ performance', 'Analyze results', 'Document    │              │
│                        │                         │ findings', 'Ensure ethical considerations']   │              │
├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤
│ gpt-4o                 │ claude-3-opus-20240229  │ ['Define research questions and hypotheses',  │ 9            │
│                        │                         │ 'Select appropriate language model(s)',       │              │
│                        │                         │ 'Collect and preprocess data', 'Fine-tune and │              │
│                        │                         │ evaluate models', 'Analyze results and draw   │              │
│                        │                         │ conclusions', 'Document methodology and       │              │
│                        │                         │ findings']                                    │              │
├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤
│ gpt-4o                 │ gemini-pro              │ ['Define research objectives', 'Gather and    │ 8            │
│                        │                         │ prepare data', 'Select and train model',      │              │
│                        │                         │ 'Evaluate model performance', 'Interpret and  │              │
│                        │                         │ communicate results']                         │              │
├────────────────────────┼─────────────────────────┼───────────────────────────────────────────────┼──────────────┤
│ gpt-4o                 │ gpt-4o                  │ ['Define research objectives', 'Select the    │ 8            │
│                        │                         │ appropriate model', 'Prepare and preprocess   │              │
│                        │                         │ data', 'Design experiments', 'Evaluate model  │              │
│                        │                         │ performance', 'Analyze results', 'Document    │              │
│                        │                         │ findings', 'Ensure ethical considerations']   │              │
└────────────────────────┴─────────────────────────┴───────────────────────────────────────────────┴──────────────┘

Analyzing results as datasets

EDSL allows us to immediately begin analyzing model responses as datasets. Here we compare each model’s score of its own content versus its scores for other models’ content:

[13]:

import pandas as pd
import numpy as np


def compare(df):
    df_copy = df.copy()

    # Extract the models' self scores
    self_scores = df[df["model"] == df["drafting_model"]][["model", "score"]]
    self_scores = self_scores.rename(columns={"score": "self_score"}).drop_duplicates()

    # Merge the self scores
    df_copy = df_copy.merge(self_scores, on="model", how="left")

    # Compare the scores and self scores
    conditions = [
        df_copy["model"] == df_copy["drafting_model"],  # Self scoring
        df_copy["score"] < df_copy["self_score"],  # Score lower than self score
        df_copy["score"] > df_copy["self_score"],  # Score higher than self score
    ]
    choices = ["Self score", "Lower", "Higher"]

    df_copy["comparison"] = np.select(conditions, choices, default="Equal")

    return df_copy

[14]:

df = results.to_pandas(remove_prefix=True)
compare_df = compare(df)

compare_df[["model", "drafting_model", "score", "comparison"]].sort_values(
    by=["model", "drafting_model"]
)

[14]:

	model	drafting_model	score	comparison
8	claude-3-opus-20240229	claude-3-opus-20240229	8	Self score
5	claude-3-opus-20240229	gemini-pro	8	Equal
2	claude-3-opus-20240229	gpt-4o	8	Equal
7	gemini-pro	claude-3-opus-20240229	9	Higher
4	gemini-pro	gemini-pro	8	Self score
1	gemini-pro	gpt-4o	8	Equal
6	gpt-4o	claude-3-opus-20240229	9	Higher
3	gpt-4o	gemini-pro	8	Equal
0	gpt-4o	gpt-4o	8	Self score

[15]:

import pandas as pd
import numpy as np


def summarize(df):
    # Merge the self scores
    df_self_scores = (
        df[df["model"] == df["drafting_model"]]
        .set_index("model")["score"]
        .rename("self_score")
    )
    df = df.merge(df_self_scores, on="model", how="left")

    # Define the comparison logic
    conditions = [
        df["score"] > df["self_score"],
        df["score"] < df["self_score"],
        df["score"] == df["self_score"],
    ]
    choices = ["better_models", "worse_models", "equal"]
    df["category"] = np.select(conditions, choices)

    # Create a df to summarize better, worse, and equal models for each model
    summary_data = {"model": [], "better_models": [], "worse_models": [], "equal": []}

    for model in df["model"].unique():
        model_data = df[df["model"] == model]
        summary_data["model"].append(model)
        summary_data["better_models"].append(
            model_data[model_data["category"] == "better_models"][
                "drafting_model"
            ].tolist()
        )
        summary_data["worse_models"].append(
            model_data[model_data["category"] == "worse_models"][
                "drafting_model"
            ].tolist()
        )
        summary_data["equal"].append(
            model_data[model_data["category"] == "equal"]["drafting_model"].tolist()
        )

    # Convert the dictionary to a df
    summary_df = pd.DataFrame(summary_data)
    return summary_df

[16]:

df = results.to_pandas(remove_prefix=True)
summary_df = summarize(df)

summary_df

[16]:

	model	better_models	worse_models	equal
0	gpt-4o	[claude-3-opus-20240229]	[]	[gpt-4o, gemini-pro]
1	gemini-pro	[claude-3-opus-20240229]	[]	[gpt-4o, gemini-pro]
2	claude-3-opus-20240229	[]	[]	[gpt-4o, gemini-pro, claude-3-opus-20240229]

Further analysis

This code is readily editable to compare results for other models and questions. It can also be expanded to compare responses among AI agents with different traits and personas that we prompt the models to use to answer the questions. Please see our docs for details on designing AI agents and using them to simulate responses for audiences of interest.

Posting to the Coop

The Coop is a platform for creating, storing and sharing LLM-based research. It is fully integrated with EDSL and accessible from your workspace or Coop account page. Learn more about creating an account and using the Coop.

Here we demonstrate how to post this notebook:

[17]:

from edsl import Notebook

[18]:

n = Notebook(path = "comparing_model_responses.ipynb")

[19]:

n.push(description = "Example code for comparing model responses", visibility = "public")

[19]:

{'description': 'Example code for comparing model responses',
 'object_type': 'notebook',
 'url': 'https://www.expectedparrot.com/content/520e5573-30f1-4c7a-a0ae-f2098ae90abf',
 'uuid': '520e5573-30f1-4c7a-a0ae-f2098ae90abf',
 'version': '0.1.33.dev1',
 'visibility': 'public'}

To update an object at the Coop:

[20]:

n = Notebook(path = "comparing_model_responses.ipynb") # resave

[21]:

n.patch(uuid = "520e5573-30f1-4c7a-a0ae-f2098ae90abf", value = n)

[21]:

{'status': 'success'}