Cognitive testing & creating new methods

This notebook shows some ways of using EDSL to conduct research, including data labeling, cognitive testing and creating new methods.

Cognitive testing

In this example we use the tools to evaluate some draft survey questions and suggest improvements.

[1]:
from edsl import QuestionFreeText, Agent, ScenarioList, Scenario, Model

Create a relevant persona and assign it to an agent:

[2]:
agent = Agent(traits={"background": "You are an expert in survey methodology and evaluating questionnaires."})

Identify a set of texts for review (these can also be imported):

[3]:
draft_texts = [
    "Do you feel the product is almost always of good quality?",
    "On a scale of 1 to 5, where 1 means strongly agree and 5 means strongly disagree, how satisfied are you with our service?",
    "Do you believe our IT team's collaborative synergy effectively optimizes our digital infrastructure?",
    "What do you think of our recent implementation of Project X57?",
]

Construct a question about the texts, which will be added as a parameter of the question individually:

[4]:
question = QuestionFreeText(
    question_name="review_questions",
    question_text="""Consider the following survey question: {{ draft_text }}
    Identify the problematic phrases in the excerpt and suggestion a revised version of it.""",
)

Create “scenarios” of the question with the texts as paraemeters:

[5]:
scenarios = ScenarioList(
    Scenario({"draft_text": text}) for text in draft_texts
)

Check available language models:

[6]:
# Model.available()

Select a language model (if no model is specified, GPT 4 preview is used by default):

[7]:
model = Model("gpt-4o")

Administer the survey:

[8]:
results = question.by(scenarios).by(agent).by(model).run()

List the components of the results that are generated:

[9]:
results.columns
[9]:
['agent.agent_instruction',
 'agent.agent_name',
 'agent.background',
 'answer.review_questions',
 'comment.review_questions_comment',
 'generated_tokens.review_questions_generated_tokens',
 'iteration.iteration',
 'model.frequency_penalty',
 'model.logprobs',
 'model.max_tokens',
 'model.model',
 'model.presence_penalty',
 'model.temperature',
 'model.top_logprobs',
 'model.top_p',
 'prompt.review_questions_system_prompt',
 'prompt.review_questions_user_prompt',
 'question_options.review_questions_question_options',
 'question_text.review_questions_question_text',
 'question_type.review_questions_question_type',
 'raw_model_response.review_questions_cost',
 'raw_model_response.review_questions_one_usd_buys',
 'raw_model_response.review_questions_raw_model_response',
 'scenario.draft_text']

Print select components of the results:

[10]:
(
    results.select("scenario.*", "answer.*").print(
        pretty_labels={
            "scenario.draft_text": "Draft text",
            "answer.review_questions": "Evaluation",
        }
    )
)
Draft text Evaluation
Do you believe our IT team's collaborative synergy effectively optimizes our digital infrastructure? The survey question contains several problematic phrases, such as "collaborative synergy" and "effectively optimizes," which are jargon-heavy and may confuse respondents. Additionally, the question is somewhat leading and complex. Here's a revised version: "Do you think our IT team works well together to improve our digital infrastructure?"
What do you think of our recent implementation of Project X57? The survey question "What do you think of our recent implementation of Project X57?" has a few problematic areas: 1. **Ambiguity**: The phrase "What do you think" is very broad and can lead to varied interpretations, making it difficult to analyze responses consistently. 2. **Lack of Specificity**: The question does not specify which aspects of the implementation should be evaluated (e.g., effectiveness, communication, user experience). 3. **Assumption of Awareness**: It assumes that all respondents are aware of Project X57 and its implementation details. A revised version could be: "Please rate your satisfaction with the following aspects of our recent implementation of Project X57: - Communication about the project - Ease of use - Overall effectiveness - Support provided during the implementation process Use a scale from 1 (Very Dissatisfied) to 5 (Very Satisfied)."
On a scale of 1 to 5, where 1 means strongly agree and 5 means strongly disagree, how satisfied are you with our service? The survey question you provided has a few issues: 1. The scale description is confusing because it inverts the typical order where 1 usually means "strongly disagree" and 5 means "strongly agree." 2. The scale labels "strongly agree" and "strongly disagree" are more appropriate for statements rather than satisfaction levels. 3. The question mixes agreement with satisfaction, which can be confusing for respondents. Here is a revised version of the question: **"On a scale of 1 to 5, where 1 means very dissatisfied and 5 means very satisfied, how satisfied are you with our service?"**
Do you feel the product is almost always of good quality? The phrase "almost always" in the question is problematic because it introduces ambiguity and can be interpreted differently by different respondents. Additionally, "good quality" is somewhat subjective and can vary based on individual standards. A revised version of the question could be: "How would you rate the quality of the product?" To provide more clarity and gather more precise data, you could also use a Likert scale: "How would you rate the quality of the product? - Very poor - Poor - Fair - Good

Qualitative reviews

In this example we use a set of hypothetical customer service tickets and prompt a model to extract a set of themes that we could use in follow-on questions (e.g., as a set of options to multiple choice questions).

[11]:
from edsl import QuestionList
[12]:
tickets = [
    "I waited for 20 minutes past the estimated arrival time, and the driver still hasn't arrived. This made me late for my appointment.",
    "The driver was very rude and had an unpleasant attitude during the entire ride. It was an uncomfortable experience.",
    "The driver was speeding and frequently changing lanes without signaling. I felt unsafe throughout the ride.",
    "The car I rode in was dirty and messy. There were crumbs on the seats, and it didn't look like it had been cleaned in a while.",
    "The driver took a longer route, which resulted in a significantly higher fare than expected. I believe they intentionally extended the trip.",
    "I was charged for a ride that I did not take. The ride appears on my account, but I was not in the vehicle at that time.",
    "I left my wallet in the car during my last ride. I've tried contacting the driver, but I haven't received a response.",
]

Create an agent with a relevant persona:

[13]:
a_customer_service = Agent(
    traits={
        "background": "You are an experienced customer service agent for a ridesharing company."
    }
)

Create a question about the texts:

[14]:
q_topics = QuestionList(
    question_name="ticket_topics",
    question_text="Create a list of the topics raised in these customer service tickets: {{ tickets_texts }}.",
)

Add the texts to the question:

[15]:
scenario = Scenario({"tickets_texts": "; ".join(tickets)})

Generate results:

[16]:
topics = q_topics.by(scenario).by(a_customer_service).by(model).run()

Inspect the results:

[17]:
topics.select("ticket_topics").to_list()[0]
[17]:
['Late Arrival',
 'Rude Driver',
 'Unsafe Driving',
 'Dirty Vehicle',
 'Overcharged Fare',
 'Incorrect Charge',
 'Lost Item']

Data labeling

In this example we prompt an LLM to rating the seriousness of tickets about safety issues.

See this notebook as well for a more complex data labeling exercise: Data Labeling Agents.

[18]:
from edsl import QuestionLinearScale
[19]:
safety_tickets = [
    "During my ride, I noticed that the driver was frequently checking their phone for directions, which made me a bit uncomfortable. It didn't feel like they were fully focused on the road.",
    "The driver had to brake abruptly to avoid a collision with another vehicle. It was a close call, and it left me feeling quite shaken. Please address this issue.",
    "I had a ride with a driver who was clearly speeding and weaving in and out of traffic. Their reckless driving put my safety at risk, and I'm very concerned about it.",
    "My ride was involved in a minor accident, and although no one was seriously injured, it was a scary experience. The driver is handling the situation, but I wanted to report it.",
    "I had a ride with a driver who exhibited aggressive and threatening behavior towards me during the trip. I felt genuinely unsafe and want this matter to be taken seriously.",
]
[20]:
q_rating = QuestionLinearScale(
    question_name="safety_rating",
    question_text="""On a scale from 0-10 rate the seriousness of the issue raised in this customer service ticket
    (0 = Not serious, 10 = Extremely serious): {{ ticket }}""",
    question_options=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
)
[21]:
scenarios = ScenarioList(
    Scenario({"ticket": safety_ticket}) for safety_ticket in safety_tickets
)
[22]:
r_rating = q_rating.by(scenarios).by(a_customer_service).by(model).run()
[23]:
r_rating.select("scenario.*", "answer.*").print()
scenario.ticket answer.safety_rating
I had a ride with a driver who was clearly speeding and weaving in and out of traffic. Their reckless driving put my safety at risk, and I'm very concerned about it. 9
During my ride, I noticed that the driver was frequently checking their phone for directions, which made me a bit uncomfortable. It didn't feel like they were fully focused on the road. 7
The driver had to brake abruptly to avoid a collision with another vehicle. It was a close call, and it left me feeling quite shaken. Please address this issue. 7
I had a ride with a driver who exhibited aggressive and threatening behavior towards me during the trip. I felt genuinely unsafe and want this matter to be taken seriously. 10
My ride was involved in a minor accident, and although no one was seriously injured, it was a scary experience. The driver is handling the situation, but I wanted to report it. 7

Creating new methods

We can use the question prompts to create new methods, such as a translator:

[24]:
def translate_to_german(text):
    q = QuestionFreeText(
        question_name="deutsch",
        question_text="Please translate '{{ text }}' into German",
    )
    result = q.by(Scenario({"text": text})).run()
    return result.select("deutsch").print()
[25]:
translate_to_german("Hello, friend, have you been traveling?")
answer.deutsch
Sure, the translation of 'Hello, friend, have you been traveling?' into German is: