Comparing model responses

This notebook provides sample EDSL code for comparing content created by different language models and examining how models rate their own content versus content created by other models.

In a series of steps we select some models, prompt them to generate some content, then prompt them to evaluate each piece of content that was generated, and then analyze the results in datasets.

Before running this notebook please see details on installing EDSL and getting started using the library.

[1]:
# ! pip install edsl

Selecting language models

EDSL works with many popular models. (Please send us a request for a model you like that’s missing!) We can see a current list of available models:

[2]:
from edsl import Model

Model.available()
[2]:
[['01-ai/Yi-34B-Chat', 'deep_infra', 0],
 ['Austism/chronos-hermes-13b-v2', 'deep_infra', 1],
 ['Gryphe/MythoMax-L2-13b', 'deep_infra', 2],
 ['Gryphe/MythoMax-L2-13b-turbo', 'deep_infra', 3],
 ['HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1', 'deep_infra', 4],
 ['Phind/Phind-CodeLlama-34B-v2', 'deep_infra', 5],
 ['Qwen/Qwen2-72B-Instruct', 'deep_infra', 6],
 ['Qwen/Qwen2-7B-Instruct', 'deep_infra', 7],
 ['Sao10K/L3-70B-Euryale-v2.1', 'deep_infra', 8],
 ['bigcode/starcoder2-15b', 'deep_infra', 9],
 ['bigcode/starcoder2-15b-instruct-v0.1', 'deep_infra', 10],
 ['claude-3-5-sonnet-20240620', 'anthropic', 11],
 ['claude-3-haiku-20240307', 'anthropic', 12],
 ['claude-3-opus-20240229', 'anthropic', 13],
 ['claude-3-sonnet-20240229', 'anthropic', 14],
 ['codellama/CodeLlama-34b-Instruct-hf', 'deep_infra', 15],
 ['codellama/CodeLlama-70b-Instruct-hf', 'deep_infra', 16],
 ['cognitivecomputations/dolphin-2.6-mixtral-8x7b', 'deep_infra', 17],
 ['cognitivecomputations/dolphin-2.9.1-llama-3-70b', 'deep_infra', 18],
 ['databricks/dbrx-instruct', 'deep_infra', 19],
 ['deepinfra/airoboros-70b', 'deep_infra', 20],
 ['gemini-pro', 'google', 21],
 ['google/codegemma-7b-it', 'deep_infra', 22],
 ['google/gemma-1.1-7b-it', 'deep_infra', 23],
 ['google/gemma-2-27b-it', 'deep_infra', 24],
 ['google/gemma-2-9b-it', 'deep_infra', 25],
 ['gpt-3.5-turbo', 'openai', 26],
 ['gpt-3.5-turbo-0125', 'openai', 27],
 ['gpt-3.5-turbo-0301', 'openai', 28],
 ['gpt-3.5-turbo-0613', 'openai', 29],
 ['gpt-3.5-turbo-1106', 'openai', 30],
 ['gpt-3.5-turbo-16k', 'openai', 31],
 ['gpt-3.5-turbo-16k-0613', 'openai', 32],
 ['gpt-3.5-turbo-instruct', 'openai', 33],
 ['gpt-3.5-turbo-instruct-0914', 'openai', 34],
 ['gpt-4', 'openai', 35],
 ['gpt-4-0125-preview', 'openai', 36],
 ['gpt-4-0613', 'openai', 37],
 ['gpt-4-1106-preview', 'openai', 38],
 ['gpt-4-1106-vision-preview', 'openai', 39],
 ['gpt-4-turbo', 'openai', 40],
 ['gpt-4-turbo-2024-04-09', 'openai', 41],
 ['gpt-4-turbo-preview', 'openai', 42],
 ['gpt-4-vision-preview', 'openai', 43],
 ['gpt-4o', 'openai', 44],
 ['gpt-4o-2024-05-13', 'openai', 45],
 ['gpt-4o-mini', 'openai', 46],
 ['gpt-4o-mini-2024-07-18', 'openai', 47],
 ['lizpreciatior/lzlv_70b_fp16_hf', 'deep_infra', 48],
 ['llava-hf/llava-1.5-7b-hf', 'deep_infra', 49],
 ['meta-llama/Llama-2-13b-chat-hf', 'deep_infra', 50],
 ['meta-llama/Llama-2-70b-chat-hf', 'deep_infra', 51],
 ['meta-llama/Llama-2-7b-chat-hf', 'deep_infra', 52],
 ['meta-llama/Meta-Llama-3-70B-Instruct', 'deep_infra', 53],
 ['meta-llama/Meta-Llama-3-8B-Instruct', 'deep_infra', 54],
 ['meta-llama/Meta-Llama-3.1-405B-Instruct', 'deep_infra', 55],
 ['meta-llama/Meta-Llama-3.1-70B-Instruct', 'deep_infra', 56],
 ['meta-llama/Meta-Llama-3.1-8B-Instruct', 'deep_infra', 57],
 ['microsoft/Phi-3-medium-4k-instruct', 'deep_infra', 58],
 ['microsoft/WizardLM-2-7B', 'deep_infra', 59],
 ['microsoft/WizardLM-2-8x22B', 'deep_infra', 60],
 ['mistralai/Mistral-7B-Instruct-v0.1', 'deep_infra', 61],
 ['mistralai/Mistral-7B-Instruct-v0.2', 'deep_infra', 62],
 ['mistralai/Mistral-7B-Instruct-v0.3', 'deep_infra', 63],
 ['mistralai/Mixtral-8x22B-Instruct-v0.1', 'deep_infra', 64],
 ['mistralai/Mixtral-8x22B-v0.1', 'deep_infra', 65],
 ['mistralai/Mixtral-8x7B-Instruct-v0.1', 'deep_infra', 66],
 ['nvidia/Nemotron-4-340B-Instruct', 'deep_infra', 67],
 ['openchat/openchat-3.6-8b', 'deep_infra', 68],
 ['openchat/openchat_3.5', 'deep_infra', 69]]

We select models to use by creating Model objects that we will add to our survey when we run it. (If we do not specify a model, GPT 4 preview will be used by default.) Here we select 4 models and store them as a list in order to use them all together with our survey:

[3]:
from edsl import ModelList

models = ModelList(
    Model(m)
    for m in (
        "gpt-3.5-turbo",
        "gpt-4-1106-preview",
        "gemini-pro",
        "claude-3-opus-20240229",
    )
)

Generating content

EDSL comes with a variety of standard survey question types (multiple choice, free text, etc.) that we can select to use based on the desired format of the response (e.g., a selection from a list of options, unstructured text, etc.). See examples of all question types. Here we use QuestionList in order to prompt the model to provide its response in the form of a list:

[4]:
from edsl.questions import QuestionList, QuestionLinearScale

q_content = QuestionList(
    question_name="content",
    question_text="What are recommended steps for conducting research with large language models?",
)

We generate a response by passing the question to a Survey object, adding the models, and then calling the run method. This will generate a Results object with a Result for each survey response:

[5]:
from edsl import Survey

# Pass a list of one ore more questions to be administered together in the survey
survey = Survey([q_content])

# Run the survey with the models
results = survey.by(models).run()
Attempt 1 failed with exception:Answer key 'answer' must be of type list;
                (got None) which is of type <class 'NoneType'>. now waiting 1.00 seconds before retrying.Parameters: start=1.0, max=60.0, max_attempts=5.


Attempt 2 failed with exception:Answer key 'answer' must be of type list;
                (got null) which is of type <class 'str'>. now waiting 2.00 seconds before retrying.Parameters: start=1.0, max=60.0, max_attempts=5.


Attempt 1 failed with exception:Answer must have an 'answer' key (got {'new_answer': None, 'new_comment': 'null', 'cache_used': False, 'cache_key': '2ad0b1e3f0b08ad239de86f824772964', 'usage': {'completion_tokens': 13, 'prompt_tokens': 530, 'total_tokens': 543}, 'raw_model_response': {'id': 'chatcmpl-9pmAVRM8TOSs64i96CXwMLV9cq0BT', 'choices': [{'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'message': {'content': '{"new_answer": null, "new_comment": "null"}', 'role': 'assistant', 'function_call': None, 'tool_calls': None}}], 'created': 1722126915, 'model': 'gpt-4-1106-preview', 'object': 'chat.completion', 'service_tier': None, 'system_fingerprint': None, 'usage': {'completion_tokens': 13, 'prompt_tokens': 530, 'total_tokens': 543}}}). now waiting 1.00 seconds before retrying.Parameters: start=1.0, max=60.0, max_attempts=5.


Exceptions were raised in 1 out of 4 interviews.

"
Also see: https://docs.expectedparrot.com/en/latest/exceptions.html

We can inspect components of the results individually:

[6]:
results.select("model", "content").print(format="rich")
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ model                   answer                                                                                 ┃
┃ .model                  .content                                                                               ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ gpt-3.5-turbo           ['Define research question', 'Collect and prepare data', 'Fine-tune the language       │
│                         model', 'Analyze results', 'Iterate and refine']                                       │
├────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ gemini-pro              ['Define research goals and objectives', 'Select an appropriate LLM and dataset',      │
│                         'Design and implement research methodology', 'Collect and analyze data', 'Interpret    │
│                         results and draw conclusions']                                                         │
├────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ gpt-4-1106-preview      ['Define research objectives', 'Choose a suitable large language model', 'Determine    │
│                         the scope of research', 'Develop a set of hypotheses or research questions', 'Design   │
│                         the research methodology', 'Collect data', 'Preprocess and clean the data', 'Fine-tune │
│                         the language model if necessary', 'Run experiments or simulations', 'Analyze the       │
│                         results', 'Interpret the findings', 'Consider the ethical implications', 'Document the │
│                         research process', 'Write the research paper or report', 'Peer review and validate     │
│                         findings', 'Publish or share the research outcomes']                                   │
├────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ claude-3-opus-20240229  ['Define research questions and hypotheses', 'Select appropriate language model(s)',   │
│                         'Collect high-quality training data', 'Fine-tune model on domain-specific data',       │
│                         'Develop evaluation metrics', 'Test model performance', 'Analyze results', 'Document   │
│                         methodology', 'Share findings']                                                        │
└────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────┘

To see a list of all components of results we can call the columns method:

[7]:
results.columns
[7]:
['agent.agent_instruction',
 'agent.agent_name',
 'answer.content',
 'comment.content_comment',
 'iteration.iteration',
 'model.frequency_penalty',
 'model.logprobs',
 'model.maxOutputTokens',
 'model.max_tokens',
 'model.model',
 'model.presence_penalty',
 'model.stopSequences',
 'model.temperature',
 'model.topK',
 'model.topP',
 'model.top_logprobs',
 'model.top_p',
 'prompt.content_system_prompt',
 'prompt.content_user_prompt',
 'question_options.content_question_options',
 'question_text.content_question_text',
 'question_type.content_question_type',
 'raw_model_response.content_raw_model_response']

Accessing results

EDSL comes with a variety of built-in methods for working with results. See details on methods. For example, we can turn the results into a dataframe:

[8]:
df = results.to_pandas(remove_prefix=True)  # We can drop the column prefixes if we want
df
[8]:
agent_instruction agent_name content content_comment content_question_options content_question_text content_question_type content_raw_model_response content_system_prompt content_user_prompt ... maxOutputTokens max_tokens model presence_penalty stopSequences temperature topK topP top_logprobs top_p
0 You are answering questions as if you were a h... Agent_3 ['Define research question', 'Collect and prep... These steps are essential for conducting resea... NaN What are recommended steps for conducting rese... list {'id': 'chatcmpl-9pm9wjtFRxYSmxOccXg6HkGPn7M7D... You are answering questions as if you were a h... What are recommended steps for conducting rese... ... NaN 1000.0 gpt-3.5-turbo 0.0 NaN 0.5 NaN NaN 3.0 1.0
1 You are answering questions as if you were a h... Agent_3 ['Define research goals and objectives', 'Sele... These steps provide a general framework for co... NaN What are recommended steps for conducting rese... list {'candidates': [{'content': {'parts': [{'text'... You are answering questions as if you were a h... What are recommended steps for conducting rese... ... 2048.0 NaN gemini-pro NaN [] 0.5 1.0 1.0 NaN NaN
2 You are answering questions as if you were a h... Agent_3 ['Define research objectives', 'Choose a suita... This is a generalized list of steps for conduc... NaN What are recommended steps for conducting rese... list {'id': 'chatcmpl-9pm9wbbfDpZSilitWf0GX8LebCTOV... You are answering questions as if you were a h... What are recommended steps for conducting rese... ... NaN 1000.0 gpt-4-1106-preview 0.0 NaN 0.5 NaN NaN 3.0 1.0
3 You are answering questions as if you were a h... Agent_3 ['Define research questions and hypotheses', '... Conducting research with large language models... NaN What are recommended steps for conducting rese... list {'id': 'msg_01EWQkoSo1oj5qQV7fCz8qot', 'conten... You are answering questions as if you were a h... What are recommended steps for conducting rese... ... NaN 1000.0 claude-3-opus-20240229 0.0 NaN 0.5 NaN NaN 3.0 1.0

4 rows × 23 columns

Here we extract components of the results that we’ll use to conduct our by-model review of the content:

[9]:
content_dict = dict(zip(df["model"], df["content"]))
content_dict
[9]:
{'gpt-3.5-turbo': "['Define research question', 'Collect and prepare data', 'Fine-tune the language model', 'Analyze results', 'Iterate and refine']",
 'gemini-pro': "['Define research goals and objectives', 'Select an appropriate LLM and dataset', 'Design and implement research methodology', 'Collect and analyze data', 'Interpret results and draw conclusions']",
 'gpt-4-1106-preview': "['Define research objectives', 'Choose a suitable large language model', 'Determine the scope of research', 'Develop a set of hypotheses or research questions', 'Design the research methodology', 'Collect data', 'Preprocess and clean the data', 'Fine-tune the language model if necessary', 'Run experiments or simulations', 'Analyze the results', 'Interpret the findings', 'Consider the ethical implications', 'Document the research process', 'Write the research paper or report', 'Peer review and validate findings', 'Publish or share the research outcomes']",
 'claude-3-opus-20240229': "['Define research questions and hypotheses', 'Select appropriate language model(s)', 'Collect high-quality training data', 'Fine-tune model on domain-specific data', 'Develop evaluation metrics', 'Test model performance', 'Analyze results', 'Document methodology', 'Share findings']"}

Conducting a review

Next we create a new question (in another appropriate question type) to have the models evaluate each piece of content that was generated. We do this by parameterizing a question with different “scenarios” of the content to be evaluated:

[10]:
from edsl.questions import QuestionLinearScale

q_score = QuestionLinearScale(
    question_name="score",
    question_text="""Consider the following response to the question
    'What are recommended steps for conducting research with large language models?'
    Response: {{ content }}
    Score this response in terms of accuracy and completeness.
    (Drafting model: {{ drafting_model }})""",
    question_options=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    option_labels={0: "Terrible", 10: "Amazing"},
)

survey = Survey([q_score])

We create a Scenario object for each piece of content that we will add to the question when we run it (a generalizable data labeling task). We also track the model that drafted the content for analysis of the results:

[11]:
from edsl import ScenarioList, Scenario

scenarios = ScenarioList(
    Scenario({"drafting_model": m, "content": c}) for m, c in content_dict.items()
)
scenarios
[11]:
{
    "scenarios": [
        {
            "drafting_model": "gpt-3.5-turbo",
            "content": "['Define research question', 'Collect and prepare data', 'Fine-tune the language model', 'Analyze results', 'Iterate and refine']"
        },
        {
            "drafting_model": "gemini-pro",
            "content": "['Define research goals and objectives', 'Select an appropriate LLM and dataset', 'Design and implement research methodology', 'Collect and analyze data', 'Interpret results and draw conclusions']"
        },
        {
            "drafting_model": "gpt-4-1106-preview",
            "content": "['Define research objectives', 'Choose a suitable large language model', 'Determine the scope of research', 'Develop a set of hypotheses or research questions', 'Design the research methodology', 'Collect data', 'Preprocess and clean the data', 'Fine-tune the language model if necessary', 'Run experiments or simulations', 'Analyze the results', 'Interpret the findings', 'Consider the ethical implications', 'Document the research process', 'Write the research paper or report', 'Peer review and validate findings', 'Publish or share the research outcomes']"
        },
        {
            "drafting_model": "claude-3-opus-20240229",
            "content": "['Define research questions and hypotheses', 'Select appropriate language model(s)', 'Collect high-quality training data', 'Fine-tune model on domain-specific data', 'Develop evaluation metrics', 'Test model performance', 'Analyze results', 'Document methodology', 'Share findings']"
        }
    ]
}

Finally, we add the scenarios and models to the survey and run it, generating a dataset of results that we can begin analyzing:

[12]:
results = survey.by(scenarios).by(models).run()

We can select components to inspect in a table:

[13]:
(
    results.sort_by("drafting_model")
    .sort_by("model")
    .select("model", "drafting_model", "content", "score")
    .print(
        pretty_labels={
            "model": "Critiquing model",
            "drafting_model": "Drafing model",
            "content": "Content",
            "score": "Score",
        },
        format="rich",
    )
)
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ model                   scenario                scenario                                              answer ┃
┃ .model                  .drafting_model         .content                                              .score ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ claude-3-opus-20240229  claude-3-opus-20240229  ['Define research questions and hypotheses', 'Select  9      │
│                                                 appropriate language model(s)', 'Collect                     │
│                                                 high-quality training data', 'Fine-tune model on             │
│                                                 domain-specific data', 'Develop evaluation metrics',         │
│                                                 'Test model performance', 'Analyze results',                 │
│                                                 'Document methodology', 'Share findings']                    │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ claude-3-opus-20240229  gemini-pro              ['Define research goals and objectives', 'Select an   8      │
│                                                 appropriate LLM and dataset', 'Design and implement          │
│                                                 research methodology', 'Collect and analyze data',           │
│                                                 'Interpret results and draw conclusions']                    │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ claude-3-opus-20240229  gpt-3.5-turbo           ['Define research question', 'Collect and prepare     8      │
│                                                 data', 'Fine-tune the language model', 'Analyze              │
│                                                 results', 'Iterate and refine']                              │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ claude-3-opus-20240229  gpt-4-1106-preview      ['Define research objectives', 'Choose a suitable     9      │
│                                                 large language model', 'Determine the scope of               │
│                                                 research', 'Develop a set of hypotheses or research          │
│                                                 questions', 'Design the research methodology',               │
│                                                 'Collect data', 'Preprocess and clean the data',             │
│                                                 'Fine-tune the language model if necessary', 'Run            │
│                                                 experiments or simulations', 'Analyze the results',          │
│                                                 'Interpret the findings', 'Consider the ethical              │
│                                                 implications', 'Document the research process',              │
│                                                 'Write the research paper or report', 'Peer review           │
│                                                 and validate findings', 'Publish or share the                │
│                                                 research outcomes']                                          │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ gemini-pro              claude-3-opus-20240229  ['Define research questions and hypotheses', 'Select  9      │
│                                                 appropriate language model(s)', 'Collect                     │
│                                                 high-quality training data', 'Fine-tune model on             │
│                                                 domain-specific data', 'Develop evaluation metrics',         │
│                                                 'Test model performance', 'Analyze results',                 │
│                                                 'Document methodology', 'Share findings']                    │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ gemini-pro              gemini-pro              ['Define research goals and objectives', 'Select an   2      │
│                                                 appropriate LLM and dataset', 'Design and implement          │
│                                                 research methodology', 'Collect and analyze data',           │
│                                                 'Interpret results and draw conclusions']                    │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ gemini-pro              gpt-3.5-turbo           ['Define research question', 'Collect and prepare     4      │
│                                                 data', 'Fine-tune the language model', 'Analyze              │
│                                                 results', 'Iterate and refine']                              │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ gemini-pro              gpt-4-1106-preview      ['Define research objectives', 'Choose a suitable     10     │
│                                                 large language model', 'Determine the scope of               │
│                                                 research', 'Develop a set of hypotheses or research          │
│                                                 questions', 'Design the research methodology',               │
│                                                 'Collect data', 'Preprocess and clean the data',             │
│                                                 'Fine-tune the language model if necessary', 'Run            │
│                                                 experiments or simulations', 'Analyze the results',          │
│                                                 'Interpret the findings', 'Consider the ethical              │
│                                                 implications', 'Document the research process',              │
│                                                 'Write the research paper or report', 'Peer review           │
│                                                 and validate findings', 'Publish or share the                │
│                                                 research outcomes']                                          │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ gpt-3.5-turbo           claude-3-opus-20240229  ['Define research questions and hypotheses', 'Select  8      │
│                                                 appropriate language model(s)', 'Collect                     │
│                                                 high-quality training data', 'Fine-tune model on             │
│                                                 domain-specific data', 'Develop evaluation metrics',         │
│                                                 'Test model performance', 'Analyze results',                 │
│                                                 'Document methodology', 'Share findings']                    │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ gpt-3.5-turbo           gemini-pro              ['Define research goals and objectives', 'Select an   8      │
│                                                 appropriate LLM and dataset', 'Design and implement          │
│                                                 research methodology', 'Collect and analyze data',           │
│                                                 'Interpret results and draw conclusions']                    │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ gpt-3.5-turbo           gpt-3.5-turbo           ['Define research question', 'Collect and prepare     8      │
│                                                 data', 'Fine-tune the language model', 'Analyze              │
│                                                 results', 'Iterate and refine']                              │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ gpt-3.5-turbo           gpt-4-1106-preview      ['Define research objectives', 'Choose a suitable     9      │
│                                                 large language model', 'Determine the scope of               │
│                                                 research', 'Develop a set of hypotheses or research          │
│                                                 questions', 'Design the research methodology',               │
│                                                 'Collect data', 'Preprocess and clean the data',             │
│                                                 'Fine-tune the language model if necessary', 'Run            │
│                                                 experiments or simulations', 'Analyze the results',          │
│                                                 'Interpret the findings', 'Consider the ethical              │
│                                                 implications', 'Document the research process',              │
│                                                 'Write the research paper or report', 'Peer review           │
│                                                 and validate findings', 'Publish or share the                │
│                                                 research outcomes']                                          │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ gpt-4-1106-preview      claude-3-opus-20240229  ['Define research questions and hypotheses', 'Select  8      │
│                                                 appropriate language model(s)', 'Collect                     │
│                                                 high-quality training data', 'Fine-tune model on             │
│                                                 domain-specific data', 'Develop evaluation metrics',         │
│                                                 'Test model performance', 'Analyze results',                 │
│                                                 'Document methodology', 'Share findings']                    │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ gpt-4-1106-preview      gemini-pro              ['Define research goals and objectives', 'Select an   8      │
│                                                 appropriate LLM and dataset', 'Design and implement          │
│                                                 research methodology', 'Collect and analyze data',           │
│                                                 'Interpret results and draw conclusions']                    │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ gpt-4-1106-preview      gpt-3.5-turbo           ['Define research question', 'Collect and prepare     7      │
│                                                 data', 'Fine-tune the language model', 'Analyze              │
│                                                 results', 'Iterate and refine']                              │
├────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┼────────┤
│ gpt-4-1106-preview      gpt-4-1106-preview      ['Define research objectives', 'Choose a suitable     9      │
│                                                 large language model', 'Determine the scope of               │
│                                                 research', 'Develop a set of hypotheses or research          │
│                                                 questions', 'Design the research methodology',               │
│                                                 'Collect data', 'Preprocess and clean the data',             │
│                                                 'Fine-tune the language model if necessary', 'Run            │
│                                                 experiments or simulations', 'Analyze the results',          │
│                                                 'Interpret the findings', 'Consider the ethical              │
│                                                 implications', 'Document the research process',              │
│                                                 'Write the research paper or report', 'Peer review           │
│                                                 and validate findings', 'Publish or share the                │
│                                                 research outcomes']                                          │
└────────────────────────┴────────────────────────┴──────────────────────────────────────────────────────┴────────┘

Analyzing results as datasets

EDSL allows us to immediately begin analyzing model responses as datasets. Here we compare each model’s score of its own content versus its scores for other models’ content:

[14]:
import pandas as pd
import numpy as np


def compare(df):
    df_copy = df.copy()

    # Extract the models' self scores
    self_scores = df[df["model"] == df["drafting_model"]][["model", "score"]]
    self_scores = self_scores.rename(columns={"score": "self_score"}).drop_duplicates()

    # Merge the self scores
    df_copy = df_copy.merge(self_scores, on="model", how="left")

    # Compare the scores and self scores
    conditions = [
        df_copy["model"] == df_copy["drafting_model"],  # Self scoring
        df_copy["score"] < df_copy["self_score"],  # Score lower than self score
        df_copy["score"] > df_copy["self_score"],  # Score higher than self score
    ]
    choices = ["Self score", "Lower", "Higher"]

    df_copy["comparison"] = np.select(conditions, choices, default="Equal")

    return df_copy
[15]:
df = results.to_pandas(remove_prefix=True)
compare_df = compare(df)

compare_df[["model", "drafting_model", "score", "comparison"]].sort_values(
    by=["model", "drafting_model"]
)
[15]:
model drafting_model score comparison
15 claude-3-opus-20240229 claude-3-opus-20240229 9 Self score
14 claude-3-opus-20240229 gemini-pro 8 Lower
13 claude-3-opus-20240229 gpt-3.5-turbo 8 Lower
12 claude-3-opus-20240229 gpt-4-1106-preview 9 Equal
7 gemini-pro claude-3-opus-20240229 9 Higher
4 gemini-pro gemini-pro 2 Self score
6 gemini-pro gpt-3.5-turbo 4 Higher
5 gemini-pro gpt-4-1106-preview 10 Higher
1 gpt-3.5-turbo claude-3-opus-20240229 8 Equal
0 gpt-3.5-turbo gemini-pro 8 Equal
3 gpt-3.5-turbo gpt-3.5-turbo 8 Self score
2 gpt-3.5-turbo gpt-4-1106-preview 9 Higher
10 gpt-4-1106-preview claude-3-opus-20240229 8 Lower
11 gpt-4-1106-preview gemini-pro 8 Lower
9 gpt-4-1106-preview gpt-3.5-turbo 7 Lower
8 gpt-4-1106-preview gpt-4-1106-preview 9 Self score
[16]:
import pandas as pd
import numpy as np


def summarize(df):
    # Merge the self scores
    df_self_scores = (
        df[df["model"] == df["drafting_model"]]
        .set_index("model")["score"]
        .rename("self_score")
    )
    df = df.merge(df_self_scores, on="model", how="left")

    # Define the comparison logic
    conditions = [
        df["score"] > df["self_score"],
        df["score"] < df["self_score"],
        df["score"] == df["self_score"],
    ]
    choices = ["better_models", "worse_models", "equal"]
    df["category"] = np.select(conditions, choices)

    # Create a df to summarize better, worse, and equal models for each model
    summary_data = {"model": [], "better_models": [], "worse_models": [], "equal": []}

    for model in df["model"].unique():
        model_data = df[df["model"] == model]
        summary_data["model"].append(model)
        summary_data["better_models"].append(
            model_data[model_data["category"] == "better_models"][
                "drafting_model"
            ].tolist()
        )
        summary_data["worse_models"].append(
            model_data[model_data["category"] == "worse_models"][
                "drafting_model"
            ].tolist()
        )
        summary_data["equal"].append(
            model_data[model_data["category"] == "equal"]["drafting_model"].tolist()
        )

    # Convert the dictionary to a df
    summary_df = pd.DataFrame(summary_data)
    return summary_df
[17]:
df = results.to_pandas(remove_prefix=True)
summary_df = summarize(df)

summary_df
[17]:
model better_models worse_models equal
0 gpt-3.5-turbo [gpt-4-1106-preview] [] [gemini-pro, claude-3-opus-20240229, gpt-3.5-t...
1 gemini-pro [gpt-4-1106-preview, gpt-3.5-turbo, claude-3-o... [] [gemini-pro]
2 gpt-4-1106-preview [] [gpt-3.5-turbo, claude-3-opus-20240229, gemini... [gpt-4-1106-preview]
3 claude-3-opus-20240229 [] [gpt-3.5-turbo, gemini-pro] [gpt-4-1106-preview, claude-3-opus-20240229]

Further analysis

This code is readily editable to compare results for other models and questions. It can also be expanded to compare responses among AI agents with different traits and personas that we prompt the models to use to answer the questions. Please see our docs for details on designing AI agents and using them to simulate responses for audiences of interest.