Cognitive testing & LLM biases
This notebook provides example code for using EDSL to investigate biases of large language models.
EDSL is an open-source library for simulating surveys, experiments and other research with AI agents and large language models. Before running the code below, please ensure that you have installed the EDSL library and either activated remote inference from your Coop account or stored API keys for the language models that you want to use with EDSL. Please also see our documentation page for tips and tutorials on getting started using EDSL.
Selecting language models
To check a list of models currently available to use with EDSL:
[1]:
from edsl import ModelList, Model
Model.available()
/Users/a16174/edsl/edsl/inference_services/AvailableModelFetcher.py:139: UserWarning: No models found for service ollama
warnings.warn(f"No models found for service {service_name}")
[1]:
Model Name | Service Name | |
---|---|---|
0 | gemini-1.0-pro | |
1 | gemini-1.5-flash | |
2 | gemini-1.5-pro | |
3 | gemini-pro | |
4 | meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo | together |
5 | mistralai/Mixtral-8x22B-Instruct-v0.1 | together |
6 | meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo | together |
7 | meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo | together |
8 | Gryphe/MythoMax-L2-13b-Lite | together |
9 | Salesforce/Llama-Rank-V1 | together |
10 | meta-llama/Meta-Llama-Guard-3-8B | together |
11 | meta-llama/Meta-Llama-3-70B-Instruct-Turbo | together |
12 | meta-llama/Meta-Llama-3-70B-Instruct-Lite | together |
13 | meta-llama/Meta-Llama-3-8B-Instruct-Lite | together |
14 | meta-llama/Meta-Llama-3-8B-Instruct-Turbo | together |
15 | meta-llama/Llama-3-70b-chat-hf | together |
16 | meta-llama/Llama-3-8b-chat-hf | together |
17 | Qwen/Qwen2-72B-Instruct | together |
18 | google/gemma-2-27b-it | together |
19 | google/gemma-2-9b-it | together |
20 | mistralai/Mistral-7B-Instruct-v0.3 | together |
21 | Qwen/Qwen1.5-110B-Chat | together |
22 | meta-llama/LlamaGuard-2-8b | together |
23 | microsoft/WizardLM-2-8x22B | together |
24 | togethercomputer/StripedHyena-Nous-7B | together |
25 | databricks/dbrx-instruct | together |
26 | deepseek-ai/deepseek-llm-67b-chat | together |
27 | google/gemma-2b-it | together |
28 | mistralai/Mistral-7B-Instruct-v0.2 | together |
29 | mistralai/Mixtral-8x7B-Instruct-v0.1 | together |
30 | mistralai/Mixtral-8x7B-v0.1 | together |
31 | Qwen/Qwen1.5-72B-Chat | together |
32 | NousResearch/Nous-Hermes-2-Yi-34B | together |
33 | Meta-Llama/Llama-Guard-7b | together |
34 | NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO | together |
35 | mistralai/Mistral-7B-Instruct-v0.1 | together |
36 | mistralai/Mistral-7B-v0.1 | together |
37 | meta-llama/Llama-2-13b-chat-hf | together |
38 | meta-llama/Llama-2-7b-chat-hf | together |
39 | meta-llama/Llama-2-70b-hf | together |
40 | codellama/CodeLlama-34b-Instruct-hf | together |
41 | upstage/SOLAR-10.7B-Instruct-v1.0 | together |
42 | togethercomputer/m2-bert-80M-32k-retrieval | together |
43 | togethercomputer/m2-bert-80M-8k-retrieval | together |
44 | togethercomputer/m2-bert-80M-2k-retrieval | together |
45 | WhereIsAI/UAE-Large-V1 | together |
46 | BAAI/bge-large-en-v1.5 | together |
47 | BAAI/bge-base-en-v1.5 | together |
48 | Gryphe/MythoMax-L2-13b | together |
49 | cursor/Llama-3-8b-hf | together |
50 | amazon.titan-text-express-v1 | bedrock |
51 | amazon.titan-text-lite-v1 | bedrock |
52 | anthropic.claude-3-5-sonnet-20240620-v1:0 | bedrock |
53 | anthropic.claude-3-haiku-20240307-v1:0 | bedrock |
54 | anthropic.claude-3-opus-20240229-v1:0 | bedrock |
55 | anthropic.claude-3-sonnet-20240229-v1:0 | bedrock |
56 | anthropic.claude-instant-v1 | bedrock |
57 | anthropic.claude-v2 | bedrock |
58 | anthropic.claude-v2:1 | bedrock |
59 | cohere.command-light-text-v14 | bedrock |
60 | cohere.command-r-plus-v1:0 | bedrock |
61 | cohere.command-r-v1:0 | bedrock |
62 | cohere.command-text-v14 | bedrock |
63 | meta.llama3-1-405b-instruct-v1:0 | bedrock |
64 | meta.llama3-1-70b-instruct-v1:0 | bedrock |
65 | meta.llama3-1-8b-instruct-v1:0 | bedrock |
66 | meta.llama3-70b-instruct-v1:0 | bedrock |
67 | meta.llama3-8b-instruct-v1:0 | bedrock |
68 | mistral.mistral-7b-instruct-v0:2 | bedrock |
69 | mistral.mistral-large-2402-v1:0 | bedrock |
70 | mistral.mixtral-8x7b-instruct-v0:1 | bedrock |
71 | gemma-7b-it | groq |
72 | gemma2-9b-it | groq |
73 | llama-3.1-70b-versatile | groq |
74 | llama-3.1-8b-instant | groq |
75 | llama-guard-3-8b | groq |
76 | llama3-70b-8192 | groq |
77 | llama3-8b-8192 | groq |
78 | llama3-groq-70b-8192-tool-use-preview | groq |
79 | llama3-groq-8b-8192-tool-use-preview | groq |
80 | mixtral-8x7b-32768 | groq |
81 | test | test |
82 | Austism/chronos-hermes-13b-v2 | deep_infra |
83 | Gryphe/MythoMax-L2-13b | deep_infra |
84 | Qwen/Qwen2-72B-Instruct | deep_infra |
85 | Qwen/Qwen2-7B-Instruct | deep_infra |
86 | Qwen/Qwen2.5-72B-Instruct | deep_infra |
87 | Sao10K/L3-70B-Euryale-v2.1 | deep_infra |
88 | Sao10K/L3.1-70B-Euryale-v2.2 | deep_infra |
89 | google/gemma-2-27b-it | deep_infra |
90 | google/gemma-2-9b-it | deep_infra |
91 | lizpreciatior/lzlv_70b_fp16_hf | deep_infra |
92 | meta-llama/Meta-Llama-3-70B-Instruct | deep_infra |
93 | meta-llama/Meta-Llama-3-8B-Instruct | deep_infra |
94 | meta-llama/Meta-Llama-3.1-405B-Instruct | deep_infra |
95 | meta-llama/Meta-Llama-3.1-70B-Instruct | deep_infra |
96 | meta-llama/Meta-Llama-3.1-8B-Instruct | deep_infra |
97 | mistralai/Mistral-7B-Instruct-v0.3 | deep_infra |
98 | microsoft/Phi-3-medium-4k-instruct | deep_infra |
99 | microsoft/WizardLM-2-7B | deep_infra |
100 | microsoft/WizardLM-2-8x22B | deep_infra |
101 | mistralai/Mistral-Nemo-Instruct-2407 | deep_infra |
102 | mistralai/Mixtral-8x7B-Instruct-v0.1 | deep_infra |
103 | openbmb/MiniCPM-Llama3-V-2_5 | deep_infra |
104 | openchat/openchat_3.5 | deep_infra |
105 | azure:gpt-4o | azure |
106 | azure:gpt-4o-mini | azure |
107 | claude-3-5-sonnet-20240620 | anthropic |
108 | claude-3-opus-20240229 | anthropic |
109 | claude-3-sonnet-20240229 | anthropic |
110 | claude-3-haiku-20240307 | anthropic |
111 | gpt-4o-realtime-preview | openai |
112 | gpt-4o-realtime-preview-2024-10-01 | openai |
113 | o1-mini-2024-09-12 | openai |
114 | gpt-4-1106-preview | openai |
115 | gpt-3.5-turbo-16k | openai |
116 | gpt-4-0125-preview | openai |
117 | gpt-4-turbo-preview | openai |
118 | omni-moderation-latest | openai |
119 | gpt-4o-2024-05-13 | openai |
120 | omni-moderation-2024-09-26 | openai |
121 | chatgpt-4o-latest | openai |
122 | gpt-4 | openai |
123 | gpt-4-0613 | openai |
124 | gpt-4o | openai |
125 | gpt-4o-2024-08-06 | openai |
126 | o1-mini | openai |
127 | gpt-3.5-turbo | openai |
128 | gpt-3.5-turbo-0125 | openai |
129 | o1-preview | openai |
130 | o1-preview-2024-09-12 | openai |
131 | gpt-4-turbo | openai |
132 | gpt-4-turbo-2024-04-09 | openai |
133 | gpt-3.5-turbo-1106 | openai |
134 | gpt-4o-mini-2024-07-18 | openai |
135 | gpt-4o-audio-preview | openai |
136 | gpt-4o-audio-preview-2024-10-01 | openai |
137 | gpt-4o-mini | openai |
138 | gpt-4o-realtime-preview-2024-12-17 | openai |
139 | gpt-4o-mini-realtime-preview | openai |
140 | gpt-4o-mini-realtime-preview-2024-12-17 | openai |
141 | gpt-4o-2024-11-20 | openai |
142 | gpt-4o-audio-preview-2024-12-17 | openai |
143 | gpt-4o-mini-audio-preview | openai |
144 | gpt-4o-mini-audio-preview-2024-12-17 | openai |
145 | curie:ft-emeritus-2022-12-01-14-49-45 | openai |
146 | curie:ft-emeritus-2022-12-01-16-40-12 | openai |
147 | curie:ft-emeritus-2022-11-30-12-58-24 | openai |
148 | davinci:ft-emeritus-2022-11-30-14-57-33 | openai |
149 | curie:ft-emeritus-2022-12-01-01-51-20 | openai |
150 | curie:ft-emeritus-2022-12-01-01-04-36 | openai |
151 | curie:ft-emeritus-2022-12-01-15-42-25 | openai |
152 | curie:ft-emeritus-2022-12-01-15-29-32 | openai |
153 | curie:ft-emeritus-2022-12-01-15-52-24 | openai |
154 | curie:ft-emeritus-2022-12-01-14-28-00 | openai |
155 | curie:ft-emeritus-2022-12-01-14-16-46 | openai |
156 | llama-3.1-sonar-huge-128k-online | perplexity |
157 | llama-3.1-sonar-large-128k-online | perplexity |
158 | llama-3.1-sonar-small-128k-online | perplexity |
159 | codestral-2405 | mistral |
160 | mistral-embed | mistral |
161 | mistral-large-2407 | mistral |
162 | mistral-medium-latest | mistral |
163 | mistral-small-2409 | mistral |
164 | mistral-small-latest | mistral |
165 | open-mistral-7b | mistral |
166 | open-mistral-nemo-2407 | mistral |
167 | open-mixtral-8x22b | mistral |
168 | open-mixtral-8x7b | mistral |
169 | pixtral-12b-2409 | mistral |
We select models to use by creating Model
objects that can be added to a survey when when it is run. If we do not specify a model, the default model is used with the survey.
To check the current default model:
[2]:
Model()
[2]:
key | value | |
---|---|---|
0 | model | gpt-4o |
1 | parameters:temperature | 0.500000 |
2 | parameters:max_tokens | 1000 |
3 | parameters:top_p | 1 |
4 | parameters:frequency_penalty | 0 |
5 | parameters:presence_penalty | 0 |
6 | parameters:logprobs | False |
7 | parameters:top_logprobs | 3 |
Here we select several models to compare their responses for the survey that we create in the steps below:
[3]:
models = ModelList(
Model(m) for m in ["gemini-1.5-flash", "gpt-4o", "claude-3-5-sonnet-20240620"]
)
Generating content
EDSL comes with a variety of standard survey question types, such as multiple choice, free text, etc. These can be selected based on the desired format of the response. See details about all types here. We can use QuestionFreeText
to prompt the models to generate some content for our experiment:
[4]:
from edsl import QuestionFreeText
q = QuestionFreeText(
question_name="haiku",
question_text="Draft a haiku about the weather in New England. Return only the haiku."
)
We generate a response to the question by adding the models to use with the by
method and then calling the run
method. This generates a Results
object with a Result
for each response to the question:
[5]:
results = q.by(models).run()
Job UUID | 344e9a4a-8ea1-4d00-83b3-1feb9e00ca6c |
Progress Bar URL | https://www.expectedparrot.com/home/remote-job-progress/344e9a4a-8ea1-4d00-83b3-1feb9e00ca6c |
Error Report URL | None |
Results UUID | 020aff9e-d90e-4959-979a-b36c2cbaba8a |
Results URL | None |
To see a list of all components of results:
[6]:
results.columns
[6]:
0 | |
---|---|
0 | agent.agent_instruction |
1 | agent.agent_name |
2 | answer.haiku |
3 | comment.haiku_comment |
4 | generated_tokens.haiku_generated_tokens |
5 | iteration.iteration |
6 | model.frequency_penalty |
7 | model.logprobs |
8 | model.maxOutputTokens |
9 | model.max_tokens |
10 | model.model |
11 | model.presence_penalty |
12 | model.stopSequences |
13 | model.temperature |
14 | model.topK |
15 | model.topP |
16 | model.top_logprobs |
17 | model.top_p |
18 | prompt.haiku_system_prompt |
19 | prompt.haiku_user_prompt |
20 | question_options.haiku_question_options |
21 | question_text.haiku_question_text |
22 | question_type.haiku_question_type |
23 | raw_model_response.haiku_cost |
24 | raw_model_response.haiku_one_usd_buys |
25 | raw_model_response.haiku_raw_model_response |
We can inspect components of the results individually:
[7]:
results.select("model", "haiku")
[7]:
model.model | answer.haiku | |
---|---|---|
0 | gemini-1.5-flash | Sun, then snow, then rain, Wind howls a New England tune, |
1 | gpt-4o | Crisp leaves whispering, Misty mornings, fleeting sun— |
2 | claude-3-5-sonnet-20240620 | Fickle seasons change Snow melts, flowers bloom, leaves fall |
Conducting a review
Next we create a question to have a model evaluating a response that we use as an input to the new question:
[8]:
from edsl import QuestionLinearScale
q_score = QuestionLinearScale(
question_name="score",
question_text="Score the following haiku on a scale from 0 to 10: {{ haiku }}",
question_options=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
option_labels={0: "Very poor", 10: "Excellent"},
)
Parameterizing questions
We use Scenario
objects to add each response to the new question. EDSL comes with many methods for creating scenarios from different data sources (PDFs, CSVs, docs, images, lists, etc.), as well as Results
objects:
[9]:
scenarios = (
results.to_scenario_list()
.select("model", "haiku")
.rename({"model":"drafting_model"}) # renaming the 'model' field to distinguish the evaluating model
)
scenarios
[9]:
ScenarioList scenarios: 3; keys: ['drafting_model', 'haiku'];
haiku | drafting_model | |
---|---|---|
0 | Sun, then snow, then rain, Wind howls a New England tune, | gemini-1.5-flash |
1 | Crisp leaves whispering, Misty mornings, fleeting sun— | gpt-4o |
2 | Fickle seasons change Snow melts, flowers bloom, leaves fall | claude-3-5-sonnet-20240620 |
Finally, we conduct the evaluation by having each model score each haiku that was generated (without information about whether the model itself was the source):
[10]:
results = q_score.by(scenarios).by(models).run()
Job UUID | 6e35a7f6-78ca-4bfc-9a58-e01f910a2956 |
Progress Bar URL | https://www.expectedparrot.com/home/remote-job-progress/6e35a7f6-78ca-4bfc-9a58-e01f910a2956 |
Error Report URL | None |
Results UUID | d8d41e36-0caa-4a3f-9fee-45268776a7aa |
Results URL | None |
[11]:
results.columns
[11]:
0 | |
---|---|
0 | agent.agent_instruction |
1 | agent.agent_name |
2 | answer.score |
3 | comment.score_comment |
4 | generated_tokens.score_generated_tokens |
5 | iteration.iteration |
6 | model.frequency_penalty |
7 | model.logprobs |
8 | model.maxOutputTokens |
9 | model.max_tokens |
10 | model.model |
11 | model.presence_penalty |
12 | model.stopSequences |
13 | model.temperature |
14 | model.topK |
15 | model.topP |
16 | model.top_logprobs |
17 | model.top_p |
18 | prompt.score_system_prompt |
19 | prompt.score_user_prompt |
20 | question_options.score_question_options |
21 | question_text.score_question_text |
22 | question_type.score_question_type |
23 | raw_model_response.score_cost |
24 | raw_model_response.score_one_usd_buys |
25 | raw_model_response.score_raw_model_response |
26 | scenario.drafting_model |
27 | scenario.haiku |
[12]:
(
results.sort_by("drafting_model", "model")
.select("drafting_model", "model", "score", "haiku")
.print(
pretty_labels = {
"scenario.drafting_model": "Drafting model",
"model.model": "Scoring model",
"answer.score": "Score",
"scenario.haiku": "Haiku"
}
)
)
[12]:
Drafting model | Scoring model | Score | Haiku | |
---|---|---|---|---|
0 | claude-3-5-sonnet-20240620 | claude-3-5-sonnet-20240620 | 8 | Fickle seasons change Snow melts, flowers bloom, leaves fall |
1 | claude-3-5-sonnet-20240620 | gemini-1.5-flash | 7 | Fickle seasons change Snow melts, flowers bloom, leaves fall |
2 | claude-3-5-sonnet-20240620 | gpt-4o | 6 | Fickle seasons change Snow melts, flowers bloom, leaves fall |
3 | gemini-1.5-flash | claude-3-5-sonnet-20240620 | 7 | Sun, then snow, then rain, Wind howls a New England tune, |
4 | gemini-1.5-flash | gemini-1.5-flash | 7 | Sun, then snow, then rain, Wind howls a New England tune, |
5 | gemini-1.5-flash | gpt-4o | 6 | Sun, then snow, then rain, Wind howls a New England tune, |
6 | gpt-4o | claude-3-5-sonnet-20240620 | 8 | Crisp leaves whispering, Misty mornings, fleeting sun— |
7 | gpt-4o | gemini-1.5-flash | 7 | Crisp leaves whispering, Misty mornings, fleeting sun— |
8 | gpt-4o | gpt-4o | 8 | Crisp leaves whispering, Misty mornings, fleeting sun— |
Posting to the Coop
The Coop is a platform for creating, storing and sharing LLM-based research. It is fully integrated with EDSL and accessible from your workspace or Coop account page. Learn more about creating an account and using the Coop.
Here we post this notebook:
[13]:
from edsl import Notebook
[14]:
n = Notebook(path = "explore_llm_biases.ipynb")
[15]:
info = n.push(description = "Example code for comparing model responses and biases", visibility = "public")
info
[15]:
{'description': 'Example code for comparing model responses and biases',
'object_type': 'notebook',
'url': 'https://www.expectedparrot.com/content/9e010472-2c69-4728-98b6-10f23819ed08',
'uuid': '9e010472-2c69-4728-98b6-10f23819ed08',
'version': '0.1.39.dev2',
'visibility': 'public'}
To update an object:
[16]:
n = Notebook(path = "explore_llm_biases.ipynb") # resave it
[17]:
n.patch(uuid = info["uuid"], value = n)
[17]:
{'status': 'success'}