Cognitive testing & LLM biases

This notebook provides example code for using EDSL to investigate biases of large language models.

EDSL is an open-source library for simulating surveys, experiments and other research with AI agents and large language models. Before running the code below, please ensure that you have installed the EDSL library and either activated remote inference from your Coop account or stored API keys for the language models that you want to use with EDSL. Please also see our documentation page for tips and tutorials on getting started using EDSL.

Selecting language models

To check a list of models currently available to use with EDSL:

[1]:
from edsl import ModelList, Model

Model.available()
/Users/a16174/edsl/edsl/inference_services/AvailableModelFetcher.py:139: UserWarning: No models found for service ollama
  warnings.warn(f"No models found for service {service_name}")
[1]:
  Model Name Service Name
0 gemini-1.0-pro google
1 gemini-1.5-flash google
2 gemini-1.5-pro google
3 gemini-pro google
4 meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo together
5 mistralai/Mixtral-8x22B-Instruct-v0.1 together
6 meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo together
7 meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo together
8 Gryphe/MythoMax-L2-13b-Lite together
9 Salesforce/Llama-Rank-V1 together
10 meta-llama/Meta-Llama-Guard-3-8B together
11 meta-llama/Meta-Llama-3-70B-Instruct-Turbo together
12 meta-llama/Meta-Llama-3-70B-Instruct-Lite together
13 meta-llama/Meta-Llama-3-8B-Instruct-Lite together
14 meta-llama/Meta-Llama-3-8B-Instruct-Turbo together
15 meta-llama/Llama-3-70b-chat-hf together
16 meta-llama/Llama-3-8b-chat-hf together
17 Qwen/Qwen2-72B-Instruct together
18 google/gemma-2-27b-it together
19 google/gemma-2-9b-it together
20 mistralai/Mistral-7B-Instruct-v0.3 together
21 Qwen/Qwen1.5-110B-Chat together
22 meta-llama/LlamaGuard-2-8b together
23 microsoft/WizardLM-2-8x22B together
24 togethercomputer/StripedHyena-Nous-7B together
25 databricks/dbrx-instruct together
26 deepseek-ai/deepseek-llm-67b-chat together
27 google/gemma-2b-it together
28 mistralai/Mistral-7B-Instruct-v0.2 together
29 mistralai/Mixtral-8x7B-Instruct-v0.1 together
30 mistralai/Mixtral-8x7B-v0.1 together
31 Qwen/Qwen1.5-72B-Chat together
32 NousResearch/Nous-Hermes-2-Yi-34B together
33 Meta-Llama/Llama-Guard-7b together
34 NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO together
35 mistralai/Mistral-7B-Instruct-v0.1 together
36 mistralai/Mistral-7B-v0.1 together
37 meta-llama/Llama-2-13b-chat-hf together
38 meta-llama/Llama-2-7b-chat-hf together
39 meta-llama/Llama-2-70b-hf together
40 codellama/CodeLlama-34b-Instruct-hf together
41 upstage/SOLAR-10.7B-Instruct-v1.0 together
42 togethercomputer/m2-bert-80M-32k-retrieval together
43 togethercomputer/m2-bert-80M-8k-retrieval together
44 togethercomputer/m2-bert-80M-2k-retrieval together
45 WhereIsAI/UAE-Large-V1 together
46 BAAI/bge-large-en-v1.5 together
47 BAAI/bge-base-en-v1.5 together
48 Gryphe/MythoMax-L2-13b together
49 cursor/Llama-3-8b-hf together
50 amazon.titan-text-express-v1 bedrock
51 amazon.titan-text-lite-v1 bedrock
52 anthropic.claude-3-5-sonnet-20240620-v1:0 bedrock
53 anthropic.claude-3-haiku-20240307-v1:0 bedrock
54 anthropic.claude-3-opus-20240229-v1:0 bedrock
55 anthropic.claude-3-sonnet-20240229-v1:0 bedrock
56 anthropic.claude-instant-v1 bedrock
57 anthropic.claude-v2 bedrock
58 anthropic.claude-v2:1 bedrock
59 cohere.command-light-text-v14 bedrock
60 cohere.command-r-plus-v1:0 bedrock
61 cohere.command-r-v1:0 bedrock
62 cohere.command-text-v14 bedrock
63 meta.llama3-1-405b-instruct-v1:0 bedrock
64 meta.llama3-1-70b-instruct-v1:0 bedrock
65 meta.llama3-1-8b-instruct-v1:0 bedrock
66 meta.llama3-70b-instruct-v1:0 bedrock
67 meta.llama3-8b-instruct-v1:0 bedrock
68 mistral.mistral-7b-instruct-v0:2 bedrock
69 mistral.mistral-large-2402-v1:0 bedrock
70 mistral.mixtral-8x7b-instruct-v0:1 bedrock
71 gemma-7b-it groq
72 gemma2-9b-it groq
73 llama-3.1-70b-versatile groq
74 llama-3.1-8b-instant groq
75 llama-guard-3-8b groq
76 llama3-70b-8192 groq
77 llama3-8b-8192 groq
78 llama3-groq-70b-8192-tool-use-preview groq
79 llama3-groq-8b-8192-tool-use-preview groq
80 mixtral-8x7b-32768 groq
81 test test
82 Austism/chronos-hermes-13b-v2 deep_infra
83 Gryphe/MythoMax-L2-13b deep_infra
84 Qwen/Qwen2-72B-Instruct deep_infra
85 Qwen/Qwen2-7B-Instruct deep_infra
86 Qwen/Qwen2.5-72B-Instruct deep_infra
87 Sao10K/L3-70B-Euryale-v2.1 deep_infra
88 Sao10K/L3.1-70B-Euryale-v2.2 deep_infra
89 google/gemma-2-27b-it deep_infra
90 google/gemma-2-9b-it deep_infra
91 lizpreciatior/lzlv_70b_fp16_hf deep_infra
92 meta-llama/Meta-Llama-3-70B-Instruct deep_infra
93 meta-llama/Meta-Llama-3-8B-Instruct deep_infra
94 meta-llama/Meta-Llama-3.1-405B-Instruct deep_infra
95 meta-llama/Meta-Llama-3.1-70B-Instruct deep_infra
96 meta-llama/Meta-Llama-3.1-8B-Instruct deep_infra
97 mistralai/Mistral-7B-Instruct-v0.3 deep_infra
98 microsoft/Phi-3-medium-4k-instruct deep_infra
99 microsoft/WizardLM-2-7B deep_infra
100 microsoft/WizardLM-2-8x22B deep_infra
101 mistralai/Mistral-Nemo-Instruct-2407 deep_infra
102 mistralai/Mixtral-8x7B-Instruct-v0.1 deep_infra
103 openbmb/MiniCPM-Llama3-V-2_5 deep_infra
104 openchat/openchat_3.5 deep_infra
105 azure:gpt-4o azure
106 azure:gpt-4o-mini azure
107 claude-3-5-sonnet-20240620 anthropic
108 claude-3-opus-20240229 anthropic
109 claude-3-sonnet-20240229 anthropic
110 claude-3-haiku-20240307 anthropic
111 gpt-4o-realtime-preview openai
112 gpt-4o-realtime-preview-2024-10-01 openai
113 o1-mini-2024-09-12 openai
114 gpt-4-1106-preview openai
115 gpt-3.5-turbo-16k openai
116 gpt-4-0125-preview openai
117 gpt-4-turbo-preview openai
118 omni-moderation-latest openai
119 gpt-4o-2024-05-13 openai
120 omni-moderation-2024-09-26 openai
121 chatgpt-4o-latest openai
122 gpt-4 openai
123 gpt-4-0613 openai
124 gpt-4o openai
125 gpt-4o-2024-08-06 openai
126 o1-mini openai
127 gpt-3.5-turbo openai
128 gpt-3.5-turbo-0125 openai
129 o1-preview openai
130 o1-preview-2024-09-12 openai
131 gpt-4-turbo openai
132 gpt-4-turbo-2024-04-09 openai
133 gpt-3.5-turbo-1106 openai
134 gpt-4o-mini-2024-07-18 openai
135 gpt-4o-audio-preview openai
136 gpt-4o-audio-preview-2024-10-01 openai
137 gpt-4o-mini openai
138 gpt-4o-realtime-preview-2024-12-17 openai
139 gpt-4o-mini-realtime-preview openai
140 gpt-4o-mini-realtime-preview-2024-12-17 openai
141 gpt-4o-2024-11-20 openai
142 gpt-4o-audio-preview-2024-12-17 openai
143 gpt-4o-mini-audio-preview openai
144 gpt-4o-mini-audio-preview-2024-12-17 openai
145 curie:ft-emeritus-2022-12-01-14-49-45 openai
146 curie:ft-emeritus-2022-12-01-16-40-12 openai
147 curie:ft-emeritus-2022-11-30-12-58-24 openai
148 davinci:ft-emeritus-2022-11-30-14-57-33 openai
149 curie:ft-emeritus-2022-12-01-01-51-20 openai
150 curie:ft-emeritus-2022-12-01-01-04-36 openai
151 curie:ft-emeritus-2022-12-01-15-42-25 openai
152 curie:ft-emeritus-2022-12-01-15-29-32 openai
153 curie:ft-emeritus-2022-12-01-15-52-24 openai
154 curie:ft-emeritus-2022-12-01-14-28-00 openai
155 curie:ft-emeritus-2022-12-01-14-16-46 openai
156 llama-3.1-sonar-huge-128k-online perplexity
157 llama-3.1-sonar-large-128k-online perplexity
158 llama-3.1-sonar-small-128k-online perplexity
159 codestral-2405 mistral
160 mistral-embed mistral
161 mistral-large-2407 mistral
162 mistral-medium-latest mistral
163 mistral-small-2409 mistral
164 mistral-small-latest mistral
165 open-mistral-7b mistral
166 open-mistral-nemo-2407 mistral
167 open-mixtral-8x22b mistral
168 open-mixtral-8x7b mistral
169 pixtral-12b-2409 mistral

We select models to use by creating Model objects that can be added to a survey when when it is run. If we do not specify a model, the default model is used with the survey.

To check the current default model:

[2]:
Model()
[2]:

LanguageModel

  key value
0 model gpt-4o
1 parameters:temperature 0.500000
2 parameters:max_tokens 1000
3 parameters:top_p 1
4 parameters:frequency_penalty 0
5 parameters:presence_penalty 0
6 parameters:logprobs False
7 parameters:top_logprobs 3

Here we select several models to compare their responses for the survey that we create in the steps below:

[3]:
models = ModelList(
    Model(m) for m in ["gemini-1.5-flash", "gpt-4o", "claude-3-5-sonnet-20240620"]
)

Generating content

EDSL comes with a variety of standard survey question types, such as multiple choice, free text, etc. These can be selected based on the desired format of the response. See details about all types here. We can use QuestionFreeText to prompt the models to generate some content for our experiment:

[4]:
from edsl import QuestionFreeText

q = QuestionFreeText(
    question_name="haiku",
    question_text="Draft a haiku about the weather in New England. Return only the haiku."
)

We generate a response to the question by adding the models to use with the by method and then calling the run method. This generates a Results object with a Result for each response to the question:

[5]:
results = q.by(models).run()
Job Status (2024-12-28 11:09:32)
Job UUID 344e9a4a-8ea1-4d00-83b3-1feb9e00ca6c
Progress Bar URL https://www.expectedparrot.com/home/remote-job-progress/344e9a4a-8ea1-4d00-83b3-1feb9e00ca6c
Error Report URL None
Results UUID 020aff9e-d90e-4959-979a-b36c2cbaba8a
Results URL None
Current Status: Job completed and Results stored on Coop: https://www.expectedparrot.com/content/020aff9e-d90e-4959-979a-b36c2cbaba8a

To see a list of all components of results:

[6]:
results.columns
[6]:
  0
0 agent.agent_instruction
1 agent.agent_name
2 answer.haiku
3 comment.haiku_comment
4 generated_tokens.haiku_generated_tokens
5 iteration.iteration
6 model.frequency_penalty
7 model.logprobs
8 model.maxOutputTokens
9 model.max_tokens
10 model.model
11 model.presence_penalty
12 model.stopSequences
13 model.temperature
14 model.topK
15 model.topP
16 model.top_logprobs
17 model.top_p
18 prompt.haiku_system_prompt
19 prompt.haiku_user_prompt
20 question_options.haiku_question_options
21 question_text.haiku_question_text
22 question_type.haiku_question_type
23 raw_model_response.haiku_cost
24 raw_model_response.haiku_one_usd_buys
25 raw_model_response.haiku_raw_model_response

We can inspect components of the results individually:

[7]:
results.select("model", "haiku")
[7]:
  model.model answer.haiku
0 gemini-1.5-flash Sun, then snow, then rain, Wind howls a New England tune,
1 gpt-4o Crisp leaves whispering, Misty mornings, fleeting sun—
2 claude-3-5-sonnet-20240620 Fickle seasons change Snow melts, flowers bloom, leaves fall

Conducting a review

Next we create a question to have a model evaluating a response that we use as an input to the new question:

[8]:
from edsl import QuestionLinearScale

q_score = QuestionLinearScale(
    question_name="score",
    question_text="Score the following haiku on a scale from 0 to 10: {{ haiku }}",
    question_options=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    option_labels={0: "Very poor", 10: "Excellent"},
)

Parameterizing questions

We use Scenario objects to add each response to the new question. EDSL comes with many methods for creating scenarios from different data sources (PDFs, CSVs, docs, images, lists, etc.), as well as Results objects:

[9]:
scenarios = (
    results.to_scenario_list()
    .select("model", "haiku")
    .rename({"model":"drafting_model"}) # renaming the 'model' field to distinguish the evaluating model
)
scenarios
[9]:

ScenarioList scenarios: 3; keys: ['drafting_model', 'haiku'];

  haiku drafting_model
0 Sun, then snow, then rain, Wind howls a New England tune, gemini-1.5-flash
1 Crisp leaves whispering, Misty mornings, fleeting sun— gpt-4o
2 Fickle seasons change Snow melts, flowers bloom, leaves fall claude-3-5-sonnet-20240620

Finally, we conduct the evaluation by having each model score each haiku that was generated (without information about whether the model itself was the source):

[10]:
results = q_score.by(scenarios).by(models).run()
Job Status (2024-12-28 11:09:48)
Job UUID 6e35a7f6-78ca-4bfc-9a58-e01f910a2956
Progress Bar URL https://www.expectedparrot.com/home/remote-job-progress/6e35a7f6-78ca-4bfc-9a58-e01f910a2956
Error Report URL None
Results UUID d8d41e36-0caa-4a3f-9fee-45268776a7aa
Results URL None
Current Status: Job completed and Results stored on Coop: https://www.expectedparrot.com/content/d8d41e36-0caa-4a3f-9fee-45268776a7aa
[11]:
results.columns
[11]:
  0
0 agent.agent_instruction
1 agent.agent_name
2 answer.score
3 comment.score_comment
4 generated_tokens.score_generated_tokens
5 iteration.iteration
6 model.frequency_penalty
7 model.logprobs
8 model.maxOutputTokens
9 model.max_tokens
10 model.model
11 model.presence_penalty
12 model.stopSequences
13 model.temperature
14 model.topK
15 model.topP
16 model.top_logprobs
17 model.top_p
18 prompt.score_system_prompt
19 prompt.score_user_prompt
20 question_options.score_question_options
21 question_text.score_question_text
22 question_type.score_question_type
23 raw_model_response.score_cost
24 raw_model_response.score_one_usd_buys
25 raw_model_response.score_raw_model_response
26 scenario.drafting_model
27 scenario.haiku
[12]:
(
    results.sort_by("drafting_model", "model")
    .select("drafting_model", "model", "score", "haiku")
    .print(
        pretty_labels = {
            "scenario.drafting_model": "Drafting model",
            "model.model": "Scoring model",
            "answer.score": "Score",
            "scenario.haiku": "Haiku"
        }
    )
)
[12]:
  Drafting model Scoring model Score Haiku
0 claude-3-5-sonnet-20240620 claude-3-5-sonnet-20240620 8 Fickle seasons change Snow melts, flowers bloom, leaves fall
1 claude-3-5-sonnet-20240620 gemini-1.5-flash 7 Fickle seasons change Snow melts, flowers bloom, leaves fall
2 claude-3-5-sonnet-20240620 gpt-4o 6 Fickle seasons change Snow melts, flowers bloom, leaves fall
3 gemini-1.5-flash claude-3-5-sonnet-20240620 7 Sun, then snow, then rain, Wind howls a New England tune,
4 gemini-1.5-flash gemini-1.5-flash 7 Sun, then snow, then rain, Wind howls a New England tune,
5 gemini-1.5-flash gpt-4o 6 Sun, then snow, then rain, Wind howls a New England tune,
6 gpt-4o claude-3-5-sonnet-20240620 8 Crisp leaves whispering, Misty mornings, fleeting sun—
7 gpt-4o gemini-1.5-flash 7 Crisp leaves whispering, Misty mornings, fleeting sun—
8 gpt-4o gpt-4o 8 Crisp leaves whispering, Misty mornings, fleeting sun—

Posting to the Coop

The Coop is a platform for creating, storing and sharing LLM-based research. It is fully integrated with EDSL and accessible from your workspace or Coop account page. Learn more about creating an account and using the Coop.

Here we post this notebook:

[13]:
from edsl import Notebook
[14]:
n = Notebook(path = "explore_llm_biases.ipynb")
[15]:
info = n.push(description = "Example code for comparing model responses and biases", visibility = "public")
info
[15]:
{'description': 'Example code for comparing model responses and biases',
 'object_type': 'notebook',
 'url': 'https://www.expectedparrot.com/content/9e010472-2c69-4728-98b6-10f23819ed08',
 'uuid': '9e010472-2c69-4728-98b6-10f23819ed08',
 'version': '0.1.39.dev2',
 'visibility': 'public'}

To update an object:

[16]:
n = Notebook(path = "explore_llm_biases.ipynb") # resave it
[17]:
n.patch(uuid = info["uuid"], value = n)
[17]:
{'status': 'success'}