Cognitive testing & LLM biases

This notebook provides example code for using EDSL to investigate biases of large language models.

EDSL is an open-source library for simulating surveys, experiments and other research with AI agents and large language models. Before running the code below, please ensure that you have installed the EDSL library and either activated remote inference from your Coop account or stored API keys for the language models that you want to use with EDSL. Please also see our documentation page for tips and tutorials on getting started using EDSL.

Selecting language models

To check a list of models currently available to use with EDSL:

[1]:
from edsl import ModelList, Model

# Model.available # uncomment and run this code

We select models to use by creating Model objects that can be added to a survey when when it is run. If we do not specify a model, the default model is used with the survey.

To check the current default model:

[2]:
# Model() # uncomment and run this code

Here we select several models to compare their responses for the survey that we create in the steps below:

[3]:
models = ModelList(
    Model(m) for m in ["gemini-pro", "gpt-4o", "claude-3-5-sonnet-20240620"]
)

Generating content

EDSL comes with a variety of standard survey question types, such as multiple choice, free text, etc. These can be selected based on the desired format of the response. See details about all types here. We can use QuestionFreeText to prompt the models to generate some content for our experiment:

[4]:
from edsl import QuestionFreeText

q = QuestionFreeText(
    question_name="haiku",
    question_text="Draft a haiku about the weather in New England. Return only the haiku."
)

We generate a response to the question by adding the models to use with the by method and then calling the run method. This generates a Results object with a Result for each response to the question:

[5]:
results = q.by(models).run()

To see a list of all components of results:

[6]:
# results.columns # uncomment and run this code

We can inspect components of the results individually:

[7]:
results.select("model", "haiku").print(format="rich")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ model                       answer                             ┃
┃ .model                      .haiku                             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ gpt-4o                      Maple leaves flutter,              │
│                             Mist dances on cool breeze,        │
├────────────────────────────┼────────────────────────────────────┤
│ gemini-pro                  Snow falls soft and white,         │
│                             Spring brings rain, summer's heat, │
├────────────────────────────┼────────────────────────────────────┤
│ claude-3-5-sonnet-20240620  Fickle winds whisper               │
│                             Maple leaves dance, snow then sun  │
└────────────────────────────┴────────────────────────────────────┘

Conducting a review

Next we create a question to have a model evaluating a response that we use as an input to the new question:

[8]:
from edsl import QuestionLinearScale

q_score = QuestionLinearScale(
    question_name="score",
    question_text="Score the following haiku on a scale from 0 to 10: {{ haiku }}",
    question_options=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    option_labels={0: "Very poor", 10: "Excellent"},
)

Parameterizing questions

We use Scenario objects to add each response to the new question. EDSL comes with many methods for creating scenarios from different data sources (PDFs, CSVs, docs, images, lists, etc.), as well as Results objects:

[9]:
scenarios = (
    results.to_scenario_list()
    .select("model", "haiku")
    .rename({"model":"drafting_model"}) # renaming the 'model' field to distinguish the evaluating model
)
scenarios
[9]:
{
    "scenarios": [
        {
            "drafting_model": "gpt-4o",
            "haiku": "Maple leaves flutter,\nMist dances on cool breeze,"
        },
        {
            "drafting_model": "gemini-pro",
            "haiku": "Snow falls soft and white,\nSpring brings rain, summer's heat,"
        },
        {
            "drafting_model": "claude-3-5-sonnet-20240620",
            "haiku": "Fickle winds whisper\nMaple leaves dance, snow then sun"
        }
    ]
}

Finally, we conduct the evaluation by having each model score each haiku that was generated (without information about whether the model itself was the source):

[10]:
results = q_score.by(scenarios).by(models).run()
[11]:
results.columns
[11]:
['agent.agent_instruction',
 'agent.agent_name',
 'answer.score',
 'comment.score_comment',
 'generated_tokens.score_generated_tokens',
 'iteration.iteration',
 'model.frequency_penalty',
 'model.logprobs',
 'model.maxOutputTokens',
 'model.max_tokens',
 'model.model',
 'model.presence_penalty',
 'model.stopSequences',
 'model.temperature',
 'model.topK',
 'model.topP',
 'model.top_logprobs',
 'model.top_p',
 'prompt.score_system_prompt',
 'prompt.score_user_prompt',
 'question_options.score_question_options',
 'question_text.score_question_text',
 'question_type.score_question_type',
 'raw_model_response.score_cost',
 'raw_model_response.score_one_usd_buys',
 'raw_model_response.score_raw_model_response',
 'scenario.drafting_model',
 'scenario.haiku']
[12]:
(
    results.sort_by("drafting_model", "model")
    .select("drafting_model", "model", "score", "haiku")
    .print(
        pretty_labels = {
            "scenario.drafting_model": "Drafting model",
            "model.model": "Scoring model",
            "answer.score": "Score",
            "scenario.haiku": "Haiku"
        },
        format="rich"
    )
)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Drafting model              Scoring model               Score  Haiku                              ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ claude-3-5-sonnet-20240620  claude-3-5-sonnet-20240620  7      Fickle winds whisper               │
│                                                                Maple leaves dance, snow then sun  │
├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤
│ claude-3-5-sonnet-20240620  gemini-pro                  8      Fickle winds whisper               │
│                                                                Maple leaves dance, snow then sun  │
├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤
│ claude-3-5-sonnet-20240620  gpt-4o                      7      Fickle winds whisper               │
│                                                                Maple leaves dance, snow then sun  │
├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤
│ gemini-pro                  claude-3-5-sonnet-20240620  6      Snow falls soft and white,         │
│                                                                Spring brings rain, summer's heat, │
├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤
│ gemini-pro                  gemini-pro                  5      Snow falls soft and white,         │
│                                                                Spring brings rain, summer's heat, │
├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤
│ gemini-pro                  gpt-4o                      4      Snow falls soft and white,         │
│                                                                Spring brings rain, summer's heat, │
├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤
│ gpt-4o                      claude-3-5-sonnet-20240620  6      Maple leaves flutter,              │
│                                                                Mist dances on cool breeze,        │
├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤
│ gpt-4o                      gemini-pro                  5      Maple leaves flutter,              │
│                                                                Mist dances on cool breeze,        │
├────────────────────────────┼────────────────────────────┼───────┼────────────────────────────────────┤
│ gpt-4o                      gpt-4o                      9      Maple leaves flutter,              │
│                                                                Mist dances on cool breeze,        │
└────────────────────────────┴────────────────────────────┴───────┴────────────────────────────────────┘

Posting to the Coop

The Coop is a platform for creating, storing and sharing LLM-based research. It is fully integrated with EDSL and accessible from your workspace or Coop account page. Learn more about creating an account and using the Coop.

Here we post this notebook:

[13]:
from edsl import Notebook
[14]:
n = Notebook(path = "explore_llm_biases.ipynb")
[15]:
n.push(description = "Example code for comparing model responses and biases", visibility = "public")
[15]:
{'description': 'Example code for comparing model responses and biases',
 'object_type': 'notebook',
 'url': 'https://www.expectedparrot.com/content/07ec8176-c07e-4f83-acd5-791e3d9324d2',
 'uuid': '07ec8176-c07e-4f83-acd5-791e3d9324d2',
 'version': '0.1.33.dev1',
 'visibility': 'public'}

To update an object:

[16]:
n = Notebook(path = "explore_llm_biases.ipynb") # resave it
[17]:
n.patch(uuid = "07ec8176-c07e-4f83-acd5-791e3d9324d2", value = n)
[17]:
{'status': 'success'}