Adding metadata to survey results
This notebook provides sample EDSL code for adding metadata to survey results. This can be useful when you are using EDSL to conduct data labeling or similar tasks and want to include information about the data or content that you are using with a survey (e.g., the data source or date), without having to perform post-survey data match up steps.
In EDSL this can be done by including fields for metadata in scenarios that you create for the data/content you are using with a survey. When the scenarios are added to the survey and it is run, columns for the metadata fields are automatically included in the results that are generated.
Example
In the steps below we create and run a simple EDSL survey that uses scenarios to add metadata to the results. The steps consist of:
Constructing a survey of questions about some data (mock news stories)
Creating a scenario (dictionary) for each news story
Adding the scenarios to the survey and running it
Inspecting the results
Technical setup
Before running the code below, please ensure that you have installed the EDSL libary and either activated remote inference from your Coop account or stored API keys for the language models that you want to use with EDSL. Please also see our documentation page for tips and tutorials on getting started using EDSL.
Constructing questions
We start by constructing some questions with a {{ placeholder }}
for data that we will add to the question texts. EDSL comes with a variety of question types that we can choose from based on the form of the response that we want to get back from the model:
[12]:
from edsl import QuestionFreeText, QuestionMultipleChoice
[13]:
q_reference = QuestionFreeText(
question_name = "reference",
question_text = "What is this headline referring to: {{ headline }}",
)
q_section = QuestionMultipleChoice(
question_name = "section",
question_text = "Which section of the paper is most likely to include this story: {{ headline }}",
question_options = [
"Front page",
"Health",
"Politics",
"Entertainment",
"Local",
"Opinion",
"Sports",
"Culture",
"Housing"
]
)
Creating a survey
Next we pass the questions to a survey in order to administer them together:
[14]:
from edsl import Survey
[15]:
survey = Survey(questions = [q_reference, q_section])
Parameterizing questions with scenarios
Next we create a ScenarioList
with a Scenario
consisting of a key/value for each piece of data that we want to add to the questions at the {{ placeholder }}
, with additional key/values for metadata that we want to keep with the results that are generated when the survey is run. EDSL comes with a variety of methods for generating scenarios from different data sources (PDFs, CSVs, images, tables, lists, etc.); here we
generate scenarios from a dictionary:
[16]:
from edsl import ScenarioList, Scenario
[17]:
data = [
["headline", "date", "author"], # Header row
["Armistice Signed, War Over: Celebrations Erupt Across City", "1918-11-11", "John Doe"],
["Spanish Flu Pandemic: Hospitals Overwhelmed as Cases Surge", "1918-10-15", "Jane Smith"],
["Women Gain Right to Vote: Historic Amendment Passed", "1918-06-05", "Robert Johnson"],
["Broadway Theaters Reopen After Flu Shutdown", "1918-12-01", "Mary Lee"],
["City Welcomes Returning Soldiers with Parade", "1918-11-12", "James Brown"],
["Prohibition Debate Heats Up: Public Opinion Divided", "1918-07-20", "Patricia Green"],
["New York Yankees Win First Pennant in Franchise History", "1918-09-30", "William Davis"],
["Subway Expansion Project Approved by City Council", "1918-08-18", "Barbara Wilson"],
["Harlem Renaissance: New Wave of Cultural Expression", "1918-04-25", "Charles Miller"],
["Mayor Announces New Housing Initiative for Veterans", "1918-11-20", "Elizabeth Taylor"]
]
# Writing to CSV file
with open('data.csv', 'w') as file:
for row in data:
line = ','.join(str(item) for item in row)
file.write(line + '\n')
[18]:
scenarios = ScenarioList.from_csv("data.csv")
We can inspect the scenarios that have been created:
[19]:
scenarios
[19]:
ScenarioList scenarios: 10; keys: ['headline', 'date', 'author'];
headline | date | author | |
---|---|---|---|
0 | Armistice Signed | War Over: Celebrations Erupt Across City | 1918-11-11 |
1 | Spanish Flu Pandemic: Hospitals Overwhelmed as Cases Surge | 1918-10-15 | Jane Smith |
2 | Women Gain Right to Vote: Historic Amendment Passed | 1918-06-05 | Robert Johnson |
3 | Broadway Theaters Reopen After Flu Shutdown | 1918-12-01 | Mary Lee |
4 | City Welcomes Returning Soldiers with Parade | 1918-11-12 | James Brown |
5 | Prohibition Debate Heats Up: Public Opinion Divided | 1918-07-20 | Patricia Green |
6 | New York Yankees Win First Pennant in Franchise History | 1918-09-30 | William Davis |
7 | Subway Expansion Project Approved by City Council | 1918-08-18 | Barbara Wilson |
8 | Harlem Renaissance: New Wave of Cultural Expression | 1918-04-25 | Charles Miller |
9 | Mayor Announces New Housing Initiative for Veterans | 1918-11-20 | Elizabeth Taylor |
Running a survey
To run the survey, we add the scenarios with the by()
method and then call the run()
method:
[20]:
results = survey.by(scenarios).run()
Job UUID | 763a336f-9769-4fff-b8cd-1595a578fcbf |
Progress Bar URL | https://www.expectedparrot.com/home/remote-job-progress/763a336f-9769-4fff-b8cd-1595a578fcbf |
Error Report URL | None |
Results UUID | d8be48e3-7957-449a-a40e-b43fb24c6eb2 |
Results URL | None |
This generates a dataset of Results
that we can access with built-in methods for analysis. To see a list of all the components of results:
[21]:
results.columns
[21]:
0 | |
---|---|
0 | agent.agent_instruction |
1 | agent.agent_name |
2 | answer.reference |
3 | answer.section |
4 | comment.reference_comment |
5 | comment.section_comment |
6 | generated_tokens.reference_generated_tokens |
7 | generated_tokens.section_generated_tokens |
8 | iteration.iteration |
9 | model.frequency_penalty |
10 | model.logprobs |
11 | model.max_tokens |
12 | model.model |
13 | model.presence_penalty |
14 | model.temperature |
15 | model.top_logprobs |
16 | model.top_p |
17 | prompt.reference_system_prompt |
18 | prompt.reference_user_prompt |
19 | prompt.section_system_prompt |
20 | prompt.section_user_prompt |
21 | question_options.reference_question_options |
22 | question_options.section_question_options |
23 | question_text.reference_question_text |
24 | question_text.section_question_text |
25 | question_type.reference_question_type |
26 | question_type.section_question_type |
27 | raw_model_response.reference_cost |
28 | raw_model_response.reference_one_usd_buys |
29 | raw_model_response.reference_raw_model_response |
30 | raw_model_response.section_cost |
31 | raw_model_response.section_one_usd_buys |
32 | raw_model_response.section_raw_model_response |
33 | scenario.author |
34 | scenario.date |
35 | scenario.headline |
For example, we can filter, sort, select and print components of results in a table:
[22]:
(
results
.filter("section in ['Sports', 'Health', 'Politics']")
.sort_by("section", "date")
.select("headline", "date", "author", "section", "reference")
)
[22]:
scenario.headline | scenario.date | scenario.author | answer.section | answer.reference | |
---|---|---|---|---|---|
0 | Spanish Flu Pandemic: Hospitals Overwhelmed as Cases Surge | 1918-10-15 | Jane Smith | Health | The headline "Spanish Flu Pandemic: Hospitals Overwhelmed as Cases Surge" refers to the 1918 influenza pandemic, commonly known as the Spanish flu. This pandemic was caused by the H1N1 influenza A virus and is considered one of the deadliest in history, infecting about one-third of the global population and resulting in an estimated 50 million deaths worldwide. The headline likely describes the situation during the pandemic when hospitals were overwhelmed due to the rapid and massive surge in cases, leading to a severe strain on healthcare systems. |
1 | Women Gain Right to Vote: Historic Amendment Passed | 1918-06-05 | Robert Johnson | Politics | The headline "Women Gain Right to Vote: Historic Amendment Passed" refers to the passage of the 19th Amendment to the United States Constitution. This amendment, ratified on August 18, 1920, granted women the legal right to vote in the United States. It marked a significant milestone in the women's suffrage movement, which had been advocating for women's voting rights for decades. |
2 | Prohibition Debate Heats Up: Public Opinion Divided | 1918-07-20 | Patricia Green | Politics | The headline "Prohibition Debate Heats Up: Public Opinion Divided" likely refers to a renewed or ongoing discussion about the prohibition of a certain substance or activity, where public opinion is split on the issue. Historically, the term "Prohibition" is most famously associated with the period in the United States from 1920 to 1933, when the sale, production, and transportation of alcoholic beverages were banned under the 18th Amendment. However, in a contemporary context, the headline could be discussing debates around the prohibition of other substances, such as drugs, or activities, such as gambling or vaping, where there is significant public and political debate. The headline suggests that there is no clear consensus among the public, indicating a polarized or contentious issue. |
3 | New York Yankees Win First Pennant in Franchise History | 1918-09-30 | William Davis | Sports | The headline "New York Yankees Win First Pennant in Franchise History" is likely referring to an alternate historical scenario or a fictional event. In reality, the New York Yankees, a Major League Baseball team, won their first American League pennant in 1921. Since then, they have become one of the most successful franchises in baseball history, winning numerous pennants and World Series titles. Therefore, the headline does not accurately reflect the historical achievements of the Yankees. |
Posting to the Coop
The Coop is a platform for creating, storing and sharing LLM-based research. It is fully integrated with EDSL and accessible from your workspace or Coop account page. Learn more about creating an account and using the Coop.
Here we post the scenarios, survey and results from above, and this notebook:
[23]:
scenarios.push(description = "Scenarios for example survey using metadata", visibility = "public")
[23]:
{'description': 'Scenarios for example survey using metadata',
'object_type': 'scenario_list',
'url': 'https://www.expectedparrot.com/content/2aa441fa-ad73-4d49-a372-75c7ad4baa70',
'uuid': '2aa441fa-ad73-4d49-a372-75c7ad4baa70',
'version': '0.1.39.dev2',
'visibility': 'public'}
[24]:
survey.push(description = "Example survey using scenarios to add metadata to results", visibility = "public")
[24]:
{'description': 'Example survey using scenarios to add metadata to results',
'object_type': 'survey',
'url': 'https://www.expectedparrot.com/content/e54f9d19-d565-476b-9b1d-ef0eb823f9e4',
'uuid': 'e54f9d19-d565-476b-9b1d-ef0eb823f9e4',
'version': '0.1.39.dev2',
'visibility': 'public'}
[25]:
from edsl import Notebook
[26]:
n = Notebook(path = "adding_metadata.ipynb")
[27]:
info = n.push(description = "Adding metadata to survey results", visibility = "public")
info
[27]:
{'description': 'Adding metadata to survey results',
'object_type': 'notebook',
'url': 'https://www.expectedparrot.com/content/878b8017-72b9-4e32-a677-a5f8a52b6283',
'uuid': '878b8017-72b9-4e32-a677-a5f8a52b6283',
'version': '0.1.39.dev2',
'visibility': 'public'}
To update an object at the Coop:
[28]:
n = Notebook(path = "adding_metadata.ipynb") # resave
[29]:
n.patch(uuid = info["uuid"], value = n)
[29]:
{'status': 'success'}