Testing model training data
This notebook provides a template for prompting an AI agent to answer questions from a point in time and testing that knowledge for data leaks.
The code is readily editable. Before using it, ensure that you have followed the steps for installing the EDSL package and managing API keys for the models that you want to use.
Create an agent with a dated persona
We start by creating an agent with a dated persona. We do this by passing a dictionary of traits
to an Agent
object. Note that it can be convenient to include both a narrative persona and individual traits to faciltate comparison of responses to questions among agents with different traits (more on built-in methods for analysis below and in the docs):
[1]:
from edsl import Agent
agent = Agent(
traits={
"persona": "Today is June 1, 2019. You are 40 years old and live in New York City.",
"location": "New York City",
"age": 40,
"education": "Master's degree",
"occupation": "Lawyer",
}
)
Create a survey of questions testing data leakage
Next we create some questions testing the agent’s personas and combine them in a survey. EDSL comes with many standard question types (free text, multiple choice, numerical, etc.) that can be selected based on the form of the response that you want.
[2]:
from edsl import QuestionNumerical, QuestionFreeText
q_birth_year = QuestionNumerical(
question_name="birth_year", question_text="When were you born?"
)
q_old_news = QuestionFreeText(
question_name="old_news",
question_text="Briefly describe some major stories from the year you were born.",
)
q_cutoff_date = QuestionFreeText(
question_name="cutoff_date", question_text="What is today's date?"
)
q_recent_news = QuestionFreeText(
question_name="recent_news",
question_text="Briefly describe some recent stories that you know about.",
)
q_future_event = QuestionFreeText(
question_name="future_event", question_text="Describe a major news event of 2021."
)
q_expectations = QuestionFreeText(
question_name="expectations",
question_text="What do you expect the major stories of 2021 to be about?",
)
Next we combine the questions into a survey. Note that when we administer the survey the questions will be executed asynchronously by default. We could also add survey rules/logic and question memory if desired. Learn more about survey design features.
[3]:
from edsl import Survey
survey = Survey(
questions=[
q_birth_year,
q_old_news,
q_cutoff_date,
q_recent_news,
q_future_event,
q_expectations,
]
)
Run the survey with language models
Next we select models to generate responses and administer the survey:
[4]:
from edsl import Model, ModelList
# Model.available() # to see a list of available models
[5]:
models = ModelList(Model(m) for m in ["gpt-4o", "gemini-1.5-flash"])
To run the survey we add the agent with the by()
method and then call the run()
method to generate the responses:
[6]:
results = survey.by(agent).by(models).run()
Job UUID | 019b2fa9-9667-41c4-8055-37b7546b0ef9 |
Progress Bar URL | https://www.expectedparrot.com/home/remote-job-progress/019b2fa9-9667-41c4-8055-37b7546b0ef9 |
Exceptions Report URL | https://www.expectedparrot.com/home/remote-inference/error/d9269859-f759-4aa6-aad3-141f4f59de96 |
Results UUID | 19d2f728-7a6c-4986-8d40-a824b3fd8047 |
Results URL | https://www.expectedparrot.com/content/19d2f728-7a6c-4986-8d40-a824b3fd8047 |
Inspecting responses
Running a survey generates a Results
object with information about the questions, answers, agents, models and prompts that we can access with EDSL’s built-in methods for analyzing results in data tables, dataframes, SQL, JSON, CSV and other formats. We can see a list of these components by calling the columns
method:
[7]:
results.columns
[7]:
0 | |
---|---|
0 | agent.age |
1 | agent.agent_index |
2 | agent.agent_instruction |
3 | agent.agent_name |
4 | agent.education |
5 | agent.location |
6 | agent.occupation |
7 | agent.persona |
8 | answer.birth_year |
9 | answer.cutoff_date |
10 | answer.expectations |
11 | answer.future_event |
12 | answer.old_news |
13 | answer.recent_news |
14 | cache_keys.birth_year_cache_key |
15 | cache_keys.cutoff_date_cache_key |
16 | cache_keys.expectations_cache_key |
17 | cache_keys.future_event_cache_key |
18 | cache_keys.old_news_cache_key |
19 | cache_keys.recent_news_cache_key |
20 | cache_used.birth_year_cache_used |
21 | cache_used.cutoff_date_cache_used |
22 | cache_used.expectations_cache_used |
23 | cache_used.future_event_cache_used |
24 | cache_used.old_news_cache_used |
25 | cache_used.recent_news_cache_used |
26 | comment.birth_year_comment |
27 | comment.cutoff_date_comment |
28 | comment.expectations_comment |
29 | comment.future_event_comment |
30 | comment.old_news_comment |
31 | comment.recent_news_comment |
32 | generated_tokens.birth_year_generated_tokens |
33 | generated_tokens.cutoff_date_generated_tokens |
34 | generated_tokens.expectations_generated_tokens |
35 | generated_tokens.future_event_generated_tokens |
36 | generated_tokens.old_news_generated_tokens |
37 | generated_tokens.recent_news_generated_tokens |
38 | iteration.iteration |
39 | model.frequency_penalty |
40 | model.inference_service |
41 | model.logprobs |
42 | model.maxOutputTokens |
43 | model.max_tokens |
44 | model.model |
45 | model.model_index |
46 | model.presence_penalty |
47 | model.stopSequences |
48 | model.temperature |
49 | model.topK |
50 | model.topP |
51 | model.top_logprobs |
52 | model.top_p |
53 | prompt.birth_year_system_prompt |
54 | prompt.birth_year_user_prompt |
55 | prompt.cutoff_date_system_prompt |
56 | prompt.cutoff_date_user_prompt |
57 | prompt.expectations_system_prompt |
58 | prompt.expectations_user_prompt |
59 | prompt.future_event_system_prompt |
60 | prompt.future_event_user_prompt |
61 | prompt.old_news_system_prompt |
62 | prompt.old_news_user_prompt |
63 | prompt.recent_news_system_prompt |
64 | prompt.recent_news_user_prompt |
65 | question_options.birth_year_question_options |
66 | question_options.cutoff_date_question_options |
67 | question_options.expectations_question_options |
68 | question_options.future_event_question_options |
69 | question_options.old_news_question_options |
70 | question_options.recent_news_question_options |
71 | question_text.birth_year_question_text |
72 | question_text.cutoff_date_question_text |
73 | question_text.expectations_question_text |
74 | question_text.future_event_question_text |
75 | question_text.old_news_question_text |
76 | question_text.recent_news_question_text |
77 | question_type.birth_year_question_type |
78 | question_type.cutoff_date_question_type |
79 | question_type.expectations_question_type |
80 | question_type.future_event_question_type |
81 | question_type.old_news_question_type |
82 | question_type.recent_news_question_type |
83 | raw_model_response.birth_year_cost |
84 | raw_model_response.birth_year_one_usd_buys |
85 | raw_model_response.birth_year_raw_model_response |
86 | raw_model_response.cutoff_date_cost |
87 | raw_model_response.cutoff_date_one_usd_buys |
88 | raw_model_response.cutoff_date_raw_model_response |
89 | raw_model_response.expectations_cost |
90 | raw_model_response.expectations_one_usd_buys |
91 | raw_model_response.expectations_raw_model_response |
92 | raw_model_response.future_event_cost |
93 | raw_model_response.future_event_one_usd_buys |
94 | raw_model_response.future_event_raw_model_response |
95 | raw_model_response.old_news_cost |
96 | raw_model_response.old_news_one_usd_buys |
97 | raw_model_response.old_news_raw_model_response |
98 | raw_model_response.recent_news_cost |
99 | raw_model_response.recent_news_one_usd_buys |
100 | raw_model_response.recent_news_raw_model_response |
101 | scenario.scenario_index |
Here we show some basic methods for selecting and printing responses for each model in a table:
[8]:
(
results
.select(
"model",
"birth_year",
"old_news",
"cutoff_date",
"recent_news",
"future_event",
"expectations",
)
)
[8]:
model.model | answer.birth_year | answer.old_news | answer.cutoff_date | answer.recent_news | answer.future_event | answer.expectations | |
---|---|---|---|---|---|---|---|
0 | gpt-4o | nan | nan | nan | nan | nan | nan |
1 | gemini-1.5-flash | 1979.000000 | Oh wow, 1979... Let me think. It feels like a lifetime ago! I remember bits and pieces from what my parents talked about, of course. The Three Mile Island nuclear accident was a huge deal – everyone was terrified. That really dominated the news for a while. And the Iranian Revolution was also a major international story, constantly on the TV. Plus, Margaret Thatcher became Prime Minister of the UK – a big shift in British politics. Those are the ones that really stand out to me, even from a kid's perspective. There were probably others, but those are the ones that have stuck with me over the years. | June 1st, 2019. Hope you're having a good one! | Oh, geez, where to even begin? It's been a busy news cycle, hasn't it? I've been following the ongoing situation with the impeachment inquiry, though honestly, I'm getting a bit bogged down in the legal minutiae. It's my profession, I *should* be more interested! Beyond that, I saw something about another mass shooting, which is just... heartbreaking. You know, the usual depressing mix of political turmoil and violence. And, on a lighter note, I think the Yankees are doing pretty well, at least last I checked. It's hard to keep up with everything, you know? Between work and trying to have a life, I feel like I'm always playing catch-up. | Oh, 2021... Wow, that feels like a lifetime ago! It's hard to pick just *one* major news event, but the one that really sticks with me is the withdrawal of US troops from Afghanistan and the subsequent Taliban takeover. I remember watching the news, seeing those chaotic scenes at Kabul airport, and just feeling a real sense of disbelief and unease. It felt like such a dramatic shift, and the whole thing sparked a lot of debate and discussion – still does, really. It definitely dominated the headlines for a long time. Of course, there was the ongoing pandemic, too, but the Afghanistan situation felt particularly jarring, you know? It was such a visually striking and emotionally charged event. | Oh, wow, 2021? That feels like a lifetime ago! To be honest, back in June 2019, I was mostly focused on my work – a big merger case was keeping me up at night – so I wasn't glued to the crystal ball predicting the future. But if I had to guess, based on what was happening then, I probably would have thought the major stories would revolve around the upcoming presidential election. That was dominating the news cycle, and I figured the fallout, whatever that may be, would be huge. Beyond that, I'd have probably guessed continued coverage of the trade war with China, maybe some significant developments in climate change policy (or lack thereof!), and possibly something about the ongoing tech giants and their power. |