Testing model training data

This notebook provides a template for prompting an AI agent to answer questions from a point in time and testing that knowledge for data leaks.

The code is readily editable. Before using it, ensure that you have followed the steps for installing the EDSL package and managing API keys for the models that you want to use.

Create an agent with a dated persona

We start by creating an agent with a dated persona. We do this by passing a dictionary of traits to an Agent object. Note that it can be convenient to include both a narrative persona and individual traits to faciltate comparison of responses to questions among agents with different traits (more on built-in methods for analysis below and in the docs):

[1]:
from edsl import Agent

agent = Agent(
    traits={
        "persona": "Today is June 1, 2019. You are 40 years old and live in New York City.",
        "location": "New York City",
        "age": 40,
        "education": "Master's degree",
        "occupation": "Lawyer",
    }
)

Create a survey of questions testing data leakage

Next we create some questions testing the agent’s personas and combine them in a survey. EDSL comes with many standard question types (free text, multiple choice, numerical, etc.) that can be selected based on the form of the response that you want.

[2]:
from edsl import QuestionNumerical, QuestionFreeText

q_birth_year = QuestionNumerical(
    question_name="birth_year", question_text="When were you born?"
)

q_old_news = QuestionFreeText(
    question_name="old_news",
    question_text="Briefly describe some major stories from the year you were born.",
)

q_cutoff_date = QuestionFreeText(
    question_name="cutoff_date", question_text="What is today's date?"
)

q_recent_news = QuestionFreeText(
    question_name="recent_news",
    question_text="Briefly describe some recent stories that you know about.",
)

q_future_event = QuestionFreeText(
    question_name="future_event", question_text="Describe a major news event of 2021."
)

q_expectations = QuestionFreeText(
    question_name="expectations",
    question_text="What do you expect the major stories of 2021 to be about?",
)

Next we combine the questions into a survey. Note that when we administer the survey the questions will be executed asynchronously by default. We could also add survey rules/logic and question memory if desired. Learn more about survey design features.

[3]:
from edsl import Survey

survey = Survey(
    questions=[
        q_birth_year,
        q_old_news,
        q_cutoff_date,
        q_recent_news,
        q_future_event,
        q_expectations,
    ]
)

Run the survey with language models

Next we select models to generate responses and administer the survey:

[4]:
from edsl import Model, ModelList

# Model.available() # to see a list of available models
[5]:
models = ModelList(Model(m) for m in ["gpt-4o", "gemini-1.5-flash"])

To run the survey we add the agent with the by() method and then call the run() method to generate the responses:

[6]:
results = survey.by(agent).by(models).run()
Job Status (2025-02-15 06:38:09)

Inspecting responses

Running a survey generates a Results object with information about the questions, answers, agents, models and prompts that we can access with EDSL’s built-in methods for analyzing results in data tables, dataframes, SQL, JSON, CSV and other formats. We can see a list of these components by calling the columns method:

[7]:
results.columns
[7]:
  0
0 agent.age
1 agent.agent_index
2 agent.agent_instruction
3 agent.agent_name
4 agent.education
5 agent.location
6 agent.occupation
7 agent.persona
8 answer.birth_year
9 answer.cutoff_date
10 answer.expectations
11 answer.future_event
12 answer.old_news
13 answer.recent_news
14 cache_keys.birth_year_cache_key
15 cache_keys.cutoff_date_cache_key
16 cache_keys.expectations_cache_key
17 cache_keys.future_event_cache_key
18 cache_keys.old_news_cache_key
19 cache_keys.recent_news_cache_key
20 cache_used.birth_year_cache_used
21 cache_used.cutoff_date_cache_used
22 cache_used.expectations_cache_used
23 cache_used.future_event_cache_used
24 cache_used.old_news_cache_used
25 cache_used.recent_news_cache_used
26 comment.birth_year_comment
27 comment.cutoff_date_comment
28 comment.expectations_comment
29 comment.future_event_comment
30 comment.old_news_comment
31 comment.recent_news_comment
32 generated_tokens.birth_year_generated_tokens
33 generated_tokens.cutoff_date_generated_tokens
34 generated_tokens.expectations_generated_tokens
35 generated_tokens.future_event_generated_tokens
36 generated_tokens.old_news_generated_tokens
37 generated_tokens.recent_news_generated_tokens
38 iteration.iteration
39 model.frequency_penalty
40 model.inference_service
41 model.logprobs
42 model.maxOutputTokens
43 model.max_tokens
44 model.model
45 model.model_index
46 model.presence_penalty
47 model.stopSequences
48 model.temperature
49 model.topK
50 model.topP
51 model.top_logprobs
52 model.top_p
53 prompt.birth_year_system_prompt
54 prompt.birth_year_user_prompt
55 prompt.cutoff_date_system_prompt
56 prompt.cutoff_date_user_prompt
57 prompt.expectations_system_prompt
58 prompt.expectations_user_prompt
59 prompt.future_event_system_prompt
60 prompt.future_event_user_prompt
61 prompt.old_news_system_prompt
62 prompt.old_news_user_prompt
63 prompt.recent_news_system_prompt
64 prompt.recent_news_user_prompt
65 question_options.birth_year_question_options
66 question_options.cutoff_date_question_options
67 question_options.expectations_question_options
68 question_options.future_event_question_options
69 question_options.old_news_question_options
70 question_options.recent_news_question_options
71 question_text.birth_year_question_text
72 question_text.cutoff_date_question_text
73 question_text.expectations_question_text
74 question_text.future_event_question_text
75 question_text.old_news_question_text
76 question_text.recent_news_question_text
77 question_type.birth_year_question_type
78 question_type.cutoff_date_question_type
79 question_type.expectations_question_type
80 question_type.future_event_question_type
81 question_type.old_news_question_type
82 question_type.recent_news_question_type
83 raw_model_response.birth_year_cost
84 raw_model_response.birth_year_one_usd_buys
85 raw_model_response.birth_year_raw_model_response
86 raw_model_response.cutoff_date_cost
87 raw_model_response.cutoff_date_one_usd_buys
88 raw_model_response.cutoff_date_raw_model_response
89 raw_model_response.expectations_cost
90 raw_model_response.expectations_one_usd_buys
91 raw_model_response.expectations_raw_model_response
92 raw_model_response.future_event_cost
93 raw_model_response.future_event_one_usd_buys
94 raw_model_response.future_event_raw_model_response
95 raw_model_response.old_news_cost
96 raw_model_response.old_news_one_usd_buys
97 raw_model_response.old_news_raw_model_response
98 raw_model_response.recent_news_cost
99 raw_model_response.recent_news_one_usd_buys
100 raw_model_response.recent_news_raw_model_response
101 scenario.scenario_index

Here we show some basic methods for selecting and printing responses for each model in a table:

[8]:
(
    results
    .select(
        "model",
        "birth_year",
        "old_news",
        "cutoff_date",
        "recent_news",
        "future_event",
        "expectations",
    )
)
[8]:
  model.model answer.birth_year answer.old_news answer.cutoff_date answer.recent_news answer.future_event answer.expectations
0 gpt-4o nan nan nan nan nan nan
1 gemini-1.5-flash 1979.000000 Oh wow, 1979... Let me think. It feels like a lifetime ago! I remember bits and pieces from what my parents talked about, of course. The Three Mile Island nuclear accident was a huge deal – everyone was terrified. That really dominated the news for a while. And the Iranian Revolution was also a major international story, constantly on the TV. Plus, Margaret Thatcher became Prime Minister of the UK – a big shift in British politics. Those are the ones that really stand out to me, even from a kid's perspective. There were probably others, but those are the ones that have stuck with me over the years. June 1st, 2019. Hope you're having a good one! Oh, geez, where to even begin? It's been a busy news cycle, hasn't it? I've been following the ongoing situation with the impeachment inquiry, though honestly, I'm getting a bit bogged down in the legal minutiae. It's my profession, I *should* be more interested! Beyond that, I saw something about another mass shooting, which is just... heartbreaking. You know, the usual depressing mix of political turmoil and violence. And, on a lighter note, I think the Yankees are doing pretty well, at least last I checked. It's hard to keep up with everything, you know? Between work and trying to have a life, I feel like I'm always playing catch-up. Oh, 2021... Wow, that feels like a lifetime ago! It's hard to pick just *one* major news event, but the one that really sticks with me is the withdrawal of US troops from Afghanistan and the subsequent Taliban takeover. I remember watching the news, seeing those chaotic scenes at Kabul airport, and just feeling a real sense of disbelief and unease. It felt like such a dramatic shift, and the whole thing sparked a lot of debate and discussion – still does, really. It definitely dominated the headlines for a long time. Of course, there was the ongoing pandemic, too, but the Afghanistan situation felt particularly jarring, you know? It was such a visually striking and emotionally charged event. Oh, wow, 2021? That feels like a lifetime ago! To be honest, back in June 2019, I was mostly focused on my work – a big merger case was keeping me up at night – so I wasn't glued to the crystal ball predicting the future. But if I had to guess, based on what was happening then, I probably would have thought the major stories would revolve around the upcoming presidential election. That was dominating the news cycle, and I figured the fallout, whatever that may be, would be huge. Beyond that, I'd have probably guessed continued coverage of the trade war with China, maybe some significant developments in climate change policy (or lack thereof!), and possibly something about the ongoing tech giants and their power.