Data cleaning

This notebook provides sample EDSL code for automating a data cleaning task. In a series of steps we show how to use EDSL to automatically suggest appropriate sense checks for a given dataset, convert those sense checks into survey questions, and then prompt a model to answer them to generate a new dataset of the data failing the checks.

EDSL is an open-source library for simulating surveys and experiments with AI agents and large language models. Please see our documentation page for tips and tutorials on getting started.

Example data

We’ll use some observations of ages as our dataset for cleaning. It’s a list of random ages between 22 and 85 with some bad values mixed in:

[1]:
ages = [84, 62, 79, 57, 59, 55, 68, 66, 47, 54, 76, 33, 74, 56, 47, 24, 23, 38, 38, 54, 51, 84, 71, 46, 38, 26, 50, 56, 62, 39, 31, 52, 69, 84, 69, 48, 48, 23, 65, 54, 78, 51, 69, 77, 75, 76, 26, 44, 61, 32, 70, 24, 74, 22, 32, 24, 80, 65, 36, 42, 84, 66, 40, 85, 28, 22, 67, 25, 70, 77, 53, 69, 64, 27, 61, 68, 68, 78, 0.99, 83, 58, 33, 46, 43, 50, 85, 28, 82, 50, 61, 66, 32, 45, 70, 56, 50, 43, 30, 43, 55, 33, 72, 43, 43, -5, 32, 43, 45, 67, 84, 37, 63, 52, 53, 58, 79, 79, 80, 62, 75, 57, 60, 39, 79, 49, 60, 60, 37, 45, 36, 1050, 73, 70, 56, 39, 58, 69, 77, 68, 84, 78, 48, 31, 74, 27, 55, 56, 66, 35, 39, 57, 47, 29, 24, 47, 60, 43, 37, 84, 64, 28, 22, 37, 71, 77, 76, 84, 63, 76, 58, 41, 72, 22, 63, 78, 49, 82, 69, "old", 37, 27, 29, 54, 83, 80, 74, 48, 76, 49, 26, 38, 35, 36, 25, 23, 71, 33, 39, 40, 35, 85, 24, 57, 85, 63, 53, 62, 47, 69, 76, 71, 48, 62, 23, 25, 84, 32, 63, 75, 31, 25, 50, 85, 36, 58, 85, 34, 62, 43, 2, 50, 83, 44, 73, 81, 44, 43, 82, 84, 30, 24, 63, 63, 59, 46, 30, 62, 25, 52, 23]

Constructing questions about the data

We start by creating a question prompting a model to draft some sense check questions for our data. EDSL comes with a variety of question types that we can choose from based on the desired form of the response (multiple choice, free text, etc.). Here we select QuestionList in order to prompt the model to return a list of sense check question text. Note that we specially instruct the model to include a placeholder for a piece of data in each question text so that we can run them for each piece of data all at once:

[2]:
import random

data_description = (
    "a list of ages (in years) of adult participants in a social science experiment"
)
sample_data = random.sample(ages, 5)
number_of_questions = 4
[3]:
from edsl.questions import QuestionList

q1 = QuestionList(
    question_name="data_cleaning_questions",
    question_text=f"""Consider a dataset consisting of {data_description}.
    Here is a sample of the data: {sample_data}.
    Draft a set of appropriate sense checks for the data in this dataset,
    formatted as a list of questions to be answered about each piece of data
    individually and without reference to other data, using '<data>' as a
    placeholder for the piece of data in each question text.""",
    max_list_items=number_of_questions,
)

We generate a response by calling the run() method for the question. This generates a dataset of Results that we can begin analyzing:

EDSL comes with built-in methods for analyzing results as datasets, dataframes, JSON and other forms. We can inspect a list of all the components that are directly accessible:

Here we select just the answer (the draft data cleaning question texts) and print it in a table:

[4]:
results = q1.run()
[5]:
results.select("data_cleaning_questions").print(format="rich")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ answer                                                                                                          ┃
┃ .data_cleaning_questions                                                                                        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ ['Is <data> a positive number?', 'Is <data> a whole number?', 'Is <data> within the typical human lifespan      │
│ (less than 122 years)?', 'Is <data> greater than or equal to 18?']                                              │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Parameterizing questions

Next we use the data cleaning question texts as inputs to a new question prompting the model to draft a list of appropriate answer options. This is done by creating a Scenario for each data cleaning question that we add to our new question when we run it (learn more about using scenarios):

[6]:
data_cleaning_questions = results.select("data_cleaning_questions").to_list()[0]
data_cleaning_questions
[6]:
['Is <data> a positive number?',
 'Is <data> a whole number?',
 'Is <data> within the typical human lifespan (less than 122 years)?',
 'Is <data> greater than or equal to 18?']
[7]:
from edsl.questions import QuestionList

q2 = QuestionList(
    question_name="data_cleaning_options",
    question_text="Consider a dataset consisting of "
    + data_description
    + "\nHere is a sample of the data: "
    + str(sample_data)
    + """Draft a list of appropriate answer options for the following sense check
    question about this data: {{ data_cleaning_question }}.
    Include 'Yes', 'No', and any other options that could add helpful context
    to flag a piece of data that is potentially problematic.""",
)
[8]:
from edsl import Scenario

scenarios = [Scenario({"data_cleaning_question": q}) for q in data_cleaning_questions]
scenarios
[8]:
[Scenario({'data_cleaning_question': 'Is <data> a positive number?'}),
 Scenario({'data_cleaning_question': 'Is <data> a whole number?'}),
 Scenario({'data_cleaning_question': 'Is <data> within the typical human lifespan (less than 122 years)?'}),
 Scenario({'data_cleaning_question': 'Is <data> greater than or equal to 18?'})]
[9]:
results = q2.by(scenarios).run()
[10]:
results.select("data_cleaning_question", "data_cleaning_options").print(format="rich")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ scenario                                                answer                                                 ┃
┃ .data_cleaning_question                                 .data_cleaning_options                                 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Is <data> a positive number?                            ['Yes', 'No', 'The value is not a number', 'The value  │
│                                                         is negative', 'The value is zero']                     │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ Is <data> within the typical human lifespan (less than  ['Yes', 'No', 'Data is not a number', 'Data is         │
│ 122 years)?                                             negative', 'Data exceeds maximum verified human        │
│                                                         lifespan']                                             │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ Is <data> greater than or equal to 18?                  ['Yes', 'No', 'Data is missing or non-numeric', 'Data  │
│                                                         is an outlier or unusual value']                       │
├────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┤
│ Is <data> a whole number?                               ['Yes', 'No', 'The data is not a number', 'The data is │
│                                                         a negative number', 'The data includes a decimal or    │
│                                                         fraction']                                             │
└────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┘
[11]:
questions_data = (results.
                  select("data_cleaning_question", "data_cleaning_options")
                  .to_scenario_list()
                  .rename({'data_cleaning_options':'question_options'})
                  )
[12]:
from edsl import QuestionMultipleChoice
[13]:
questions_data
[13]:
{
    "scenarios": [
        {
            "data_cleaning_question": "Is <data> a positive number?",
            "question_options": [
                "Yes",
                "No",
                "The value is not a number",
                "The value is negative",
                "The value is zero"
            ]
        },
        {
            "data_cleaning_question": "Is <data> within the typical human lifespan (less than 122 years)?",
            "question_options": [
                "Yes",
                "No",
                "Data is not a number",
                "Data is negative",
                "Data exceeds maximum verified human lifespan"
            ]
        },
        {
            "data_cleaning_question": "Is <data> greater than or equal to 18?",
            "question_options": [
                "Yes",
                "No",
                "Data is missing or non-numeric",
                "Data is an outlier or unusual value"
            ]
        },
        {
            "data_cleaning_question": "Is <data> a whole number?",
            "question_options": [
                "Yes",
                "No",
                "The data is not a number",
                "The data is a negative number",
                "The data includes a decimal or fraction"
            ]
        }
    ]
}
[17]:
questions_data.push()
[17]:
{'description': None,
 'object_type': 'scenario_list',
 'url': 'https://www.expectedparrot.com/content/11a5e588-a116-40e1-98d8-ea33b4ca7c5e',
 'uuid': '11a5e588-a116-40e1-98d8-ea33b4ca7c5e',
 'version': '0.1.33.dev1',
 'visibility': 'unlisted'}
[17]:
QuestionMultipleChoice(question_text = "{{data_cleaning_question}}",
                       question_options = "{{ question_options }}",
                       question_name = "cleaning").loop(questions_data)
---------------------------------------------------------------------------
QuestionCreationValidationError           Traceback (most recent call last)
Cell In[17], line 3
      1 QuestionMultipleChoice(question_text = "{{data_cleaning_question}}", 
      2                        question_options = "{{ question_options }}", 
----> 3                        question_name = "cleaning").loop(questions_data)

File ~/tools/edsl/edsl/questions/QuestionBaseGenMixin.py:88, in QuestionBaseGenMixin.loop(self, scenario_list)
     86     if new_data["question_name"] == staring_name:
     87         new_data["question_name"] = new_data["question_name"] + f"_{index}"
---> 88     questions.append(QuestionBase.from_dict(new_data))
     89 return questions

File ~/tools/edsl/edsl/utilities/decorators.py:44, in remove_edsl_version.<locals>.wrapper(cls, data, *args, **kwargs)
     41     if edsl_version is None or edsl_version <= "0.1.20":
     42         data_copy["question_groups"] = {}
---> 44 return func(cls, data_copy, *args, **kwargs)

File ~/tools/edsl/edsl/questions/QuestionBase.py:210, in QuestionBase.from_dict(cls, data)
    207     new_q.model_instructions = model_instructions
    208     return new_q
--> 210 return question_class(**local_data)

File ~/tools/edsl/edsl/questions/QuestionMultipleChoice.py:141, in QuestionMultipleChoice.__init__(self, question_name, question_text, question_options, include_comment, use_code, answering_instructions, question_presentation)
    139 self.question_name = question_name
    140 self.question_text = question_text
--> 141 self.question_options = question_options
    143 self._include_comment = include_comment
    144 self.use_code = use_code

File ~/tools/edsl/edsl/questions/descriptors.py:56, in BaseDescriptor.__set__(self, instance, value)
     54 def __set__(self, instance, value: Any) -> None:
     55     """Set the value of the attribute."""
---> 56     self.validate(value, instance)
     57     from edsl.prompts.registry import get_classes
     59     instance.__dict__[self.name] = value

File ~/tools/edsl/edsl/questions/descriptors.py:300, in QuestionOptionsDescriptor.validate(self, value, instance)
    298         return None
    299     else:
--> 300         raise QuestionCreationValidationError(
    301             "Dynamic question options must be of the form: '{{ question_options }}'."
    302         )
    303 if not isinstance(value, list):
    304     raise QuestionCreationValidationError(
    305         f"Question options must be a list (got {value})."
    306     )

QuestionCreationValidationError: Dynamic question options must be of the form: '{{ question_options }}'.

Constructing a data cleaning survey

Next we can make any desired edits to the data cleaning questions, and then use them to create a Survey about the data that we prompt the model to answer (learn more about constructing surveys). We use QuestionMultipleChoice to facilitate the use of the answer options that have been created:

[ ]:
results.select("data_cleaning_question", "data_cleaning_options").print(format="rich")
[ ]:
import ast

df = results.to_pandas(remove_prefix=True)
df = df[["data_cleaning_question", "data_cleaning_options"]]
[ ]:
from edsl import QuestionMultipleChoice

formatted_questions = []

for index, row in df.iterrows():
    q = QuestionMultipleChoice(
        question_name="q" + str(index),
        question_text=row["data_cleaning_question"].replace("<data>", "{{ age }}"),
        question_options=ast.literal_eval(row["data_cleaning_options"]),
    )
    formatted_questions.append(q)

formatted_questions
[ ]:
from edsl import Survey

survey = Survey(formatted_questions)

We create Scenario objects for the individual ages that we will insert in the question texts when we run the survey:

[ ]:
scenarios = [Scenario({"age": age}) for age in ages]
[ ]:
results = survey.by(scenarios).run()

Here we inspect all the components of these new results, print a table of a set of them (optionally set max_rows), and then print a table where the results are filtered to cases for review, based on the responses to the questions (see the filter logic):

[ ]:
results.columns
[ ]:
(
    results.select("age", "q0", "q1", "q2", "q3").print(
        pretty_labels={
            "scenario.age": "Age",
            "answer.q0": data_cleaning_questions[0],
            "answer.q1": data_cleaning_questions[1],
            "answer.q2": data_cleaning_questions[2],
            "answer.q3": data_cleaning_questions[3],
        },
        format="rich",
        max_rows=10,
    )
)
[ ]:
(
    results.filter("q0 != 'Yes' or q1 != 'Yes' or q2 != 'Yes' or q3 != 'Yes'")
    .select("age")
    .print(pretty_labels={"scenario.age": "Bad data"}, format="rich")
)

Further exploration

This notebook can be readily edited and expanded for other data cleaning and data labeling purposes, or to add personas for AI agents answering the questions with relevant background and expertise. Learn more about using AI agents for your EDSL surveys.

Please see our documentation page for examples of other methods and use cases and let us know if you have any questions!