Using PDFs in a survey

This notebook provides sample EDSL code demonstrating a method from_pdf() that imports a PDF and automatically creates Scenario objects for the pages to use as parameters of survey questions. This can be helpful when using EDSL to extract qualitative information from a large text efficiently.

EDSL is an open-source library for simulating surveys and experiments with AI agents and large language models. Please see our documentation page for tips and tutorials on getting started.

How it works

EDSL comes with a variety of question types that we can select from based on the desired form of the response (multiple choice, free text, etc.). We can also parameterize questions with textual content in order to ask questions about it. We do this by creating a {{ placeholder }} in a question text, e.g., What are the key themes of this text: {{ text }}, and then creating Scenario objects for the content to be inserted in the placeholder when we run the survey. This allows us to administer multiple versions of a question with different inputs all at once. A common use case for this is performing data labeling tasks designed as questions about one or more pieces of textual data that can be inserted into the survey question texts. Learn more about using scenarios.


For purposes of demonstration we use a PDF copy of the first page of the recent paper Automated Social Science: Language Models as Scientist and Subjects and conduct a survey consisting of several questions about the contents of it:


Posting a PDF to Coop using the FileStore module:

from edsl import FileStore

ass_pdf = FileStore("automated_social_scientist.pdf")
    description = "Automated Social Scientist paper",
    alias = "automated-social-scientist",
    visibility = "public"

Info about the object we can use to retrieve it:

{'description': 'Automated Social Scientist paper',
 'object_type': 'scenario',
 'url': '',
 'uuid': 'eccca1bf-1703-4b35-8fe1-b30390eb7786',
 'version': '0.1.47.dev1',
 'visibility': 'public'}

Now that we have stored it at the Coop we can retrieve it (this step can be run with the UUID for any Coop object that you want to import):

from edsl import FileStore
ass_pdf = FileStore.pull('eccca1bf-1703-4b35-8fe1-b30390eb7786')

Next we create a ScenarioList for the pages:

from edsl import ScenarioList

scenarios = ScenarioList.from_pdf(ass_pdf.to_tempfile())

ScenarioList scenarios: 63; keys: ['page', 'filename', 'text'];

  filename page text
We can select pages to use if we do not want to use all of them – e.g., here we filter just the first page to use with our survey:

automated_social_scientist = scenarios.filter("page == 1")

ScenarioList scenarios: 1; keys: ['page', 'filename', 'text'];

  filename page text
0 tmpcui5my9a.pdf 1 Automated Social Science: Language Models as Scientist and Subjects∗ Benjamin S. Manning† MIT Kehang Zhu† Harvard John J. Horton MIT & NBER April 26, 2024 Abstract We present an approach for automatically generating and testing, in silico, social scientific hypotheses. This automation is made possible by recent ad- vances in large language models (LLM), but the key feature of the approach is the use of structural causal models. Structural causal models provide a lan- guage to state hypotheses, a blueprint for constructing LLM-based agents, an experimental design, and a plan for data analysis. The fitted structural causal model becomes an object available for prediction or the planning of follow-on experiments. We demonstrate the approach with several scenarios: a nego- tiation, a bail hearing, a job interview, and an auction. In each case, causal relationships are both proposed and tested by the system, finding evidence for some and not others. We provide evidence that the insights from these simulations of social interactions are not available to the LLM purely through direct elicitation. When given its proposed structural causal model for each scenario, the LLM is good at predicting the signs of estimated effects, but it cannot reliably predict the magnitudes of those estimates. In the auction experiment, the in silico simulation results closely match the predictions of auction theory, but elicited predictions of the clearing prices from the LLM are inaccurate. However, the LLM’s predictions are dramatically improved if the model can condition on the fitted structural causal model. In short, the LLM knows more than it can (immediately) tell. ∗Thanks to generous support from Drew Houston and his AI for Augmentation and Productivity seed grant. Thanks to Jordan Ellenberg, Benjamin Lira Luttges, David Holtz, Bruce Sacerdote, Paul R¨ottger, Mohammed Alsobay, Ray Duch, Matt Schwartz, David Autor, and Dean Eckles for their helpful feedback. Author’s contact information, code, and data are currently or will be available at †Both authors contributed equally to this work. 1 arXiv:2404.11794v2 [econ.GN] 25 Apr 2024

Here we create a survey of questions that we will administer with the selected PDF page. Note that the from_pdf() method requires that the scenario placeholders be {{ text }} (the key can be renamed as desired):

from edsl import QuestionFreeText, QuestionList, ScenarioList, Survey
q_summary = QuestionFreeText(
    question_text="Briefly summarize the abstract of this paper: {{ scenario.text }}",

q_authors = QuestionList(
    question_text="List the names of all the authors of the following paper: {{ scenario.text }}",

q_thanks = QuestionList(
    question_text="List the names of the people thanked in the following paper: {{ scenario.text }}",

survey = Survey([q_summary, q_authors, q_thanks])

Now we can add the scenario to to the survey and run it:

results =
We can see a list of all the components of results that are directly accessible:

We can select components of the results to inspect and print:

[9]:"summary", "authors", "thanks")
  answer.summary answer.authors answer.thanks
0 The paper introduces a method for automatically generating and testing social science hypotheses using large language models (LLMs) and structural causal models. This approach leverages LLMs to create agents and design experiments, while structural causal models help in formulating hypotheses and analyzing data. The fitted models can be used for predictions or further experiments. The authors demonstrate this method through scenarios like negotiations and auctions, where causal relationships are examined. The study finds that while LLMs can predict the direction of effects, they struggle with estimating magnitudes unless conditioned on the causal model. The research shows that LLMs possess implicit knowledge that becomes evident when structured through causal models. ['Benjamin S. Manning', 'Kehang Zhu', 'John J. Horton'] ['Drew Houston', 'Jordan Ellenberg', 'Benjamin Lira Luttges', 'David Holtz', 'Bruce Sacerdote', 'Paul Röttger', 'Mohammed Alsobay', 'Ray Duch', 'Matt Schwartz', 'David Autor', 'Dean Eckles']

Posting to the Coop

The Coop is a platform for creating, storing and sharing LLM-based research. It is fully integrated with EDSL and accessible from your workspace or Coop account page. Learn more about creating an account and using the Coop.

Here we demonstrate how to post this notebook:

from edsl import Notebook

nb = Notebook(path = "scenario_from_pdf.ipynb")

if refresh := False:
        description = "Example code for generating scenarios from PDFs",
        alias = "scenario-from-pdf-notebook",
        visibility = "public"
    nb.patch('b0bc949b-e3c9-40f8-b5e9-87e0ea2c8e3a', value = nb)