Data labeling with LLMs, validating with humans

This notebook provides example EDSL code for conducting a data labeling task with large language models and validating responses with humans. The example below consists of the following steps, which can be conducted entirely in EDSL code or interactively at your Coop account:

Construct questions about a dataset, using a placeholder in each question for the individual piece of data to be labeled (each answer is a “label” for a piece of data)
Combine the questions in a survey to administer them together
Optionally create AI agent personas to answer the questions (e.g., if there is relevant expertise or background for the task)
Select language models to generate the answers (for the agents, or without referencing any AI personas)
Run the survey with the data, agents and models to generate a formatted dataset of results
Select questions and data that you want to validate with humans to create a subset of your survey (or leave it unchanged to run the entire survey with humans)
Send a web-based version of the survey to human respondents
Compare LLM and human answers, and iterate on the data labeling survey as needed!

Before running the code below please see instructions on getting started using Expected Parrot tools for AI research.

Construct questions about a dataset

We start by creating questions about a dataset, where each answer will provide a “label” for each piece of data. EDSL comes with many common question types that we can choose from based on the form of the response that we want to get back from a model (multiple choice, linear scale, matrix, etc.).

We use a “scenario” placeholder in each question text for data that we want to add to it. This method allows us to efficiently readminister a question for each piece of data. Scenarios can be created from many types of data, including PNG, PDF, CSV, docs, lists, tables, videos, and other types.

We combine the questions in a survey in order to administer them together, asynchronously by default, or else according to any logic or rules that we want to add (e.g., skip/stop rules).

[1]:

from edsl import ScenarioList, QuestionList, QuestionNumerical, Survey

q1 = QuestionList(
    question_name = "characters",
    question_text = "Name all of the characters in this show: {{ scenario.show }}"
)

q2 = QuestionNumerical(
    question_name = "years",
    question_text = "Identify the year this show first aired: {{ scenario.show }}"
)

scenarios = ScenarioList.from_source("list", "show", ["The Simpsons", "South Park", "I Love Lucy"])

questions = q1.loop(scenarios) + q2.loop(scenarios)

survey = Survey(questions)

Generate data “labels” using LLMs

EDSL allows us to specify the models that we want to use to answer the questions, and optionally design AI agent personas for the models to reference in answering the questions. This can be useful if you want to reference specific expertise that is relevant to the labeling task.

We administer the questions by adding the scenarios, agents and models to the survey and calling the run() method. This generates a formatted dataset of Results that we can analyze with built-in methods for working with results.

[2]:

from edsl import Agent, AgentList, Model, ModelList

agents = AgentList([
    Agent(traits = {"persona":"You watch a lot of TV."})
])

models = ModelList([
    Model("gemini-1.5-flash", service_name = "google"),
    Model("gpt-4o", service_name = "openai")
])

results = survey.by(scenarios).by(agents).by(models).run()

⌃ Job Status 🦜

Completed (6 completed, 0 failed)

Job Links

Results

Progress Report

Content

Remote Jobs

Remote Cache

Identifiers

Results UUID:

964a459b...a492

Use Results.pull(uuid) to fetch results.

Job UUID:

e9fc9270...bbff

Use Jobs.pull(uuid) to fetch job.