Cognitive testing & creating new methods

This notebook shows some ways of using EDSL to conduct research, including data labeling, cognitive testing and creating new methods.

Cognitive testing

In this example we use the tools to evaluate some draft survey questions and suggest improvements.

[1]:
from edsl import QuestionFreeText, Agent, ScenarioList, Scenario, Model

Create a relevant persona and assign it to an agent:

[2]:
a = Agent(traits = {"background": "You are an expert in survey methodology and evaluating questionnaires."})

Identify a set of texts for review (these can also be imported):

[3]:
draft_texts = [
    "Do you feel the product is almost always of good quality?",
    "On a scale of 1 to 5, where 1 means strongly agree and 5 means strongly disagree, how satisfied are you with our service?",
    "Do you believe our IT team's collaborative synergy effectively optimizes our digital infrastructure?",
    "What do you think of our recent implementation of Project X57?",
]

Construct a question about the texts, which will be added as a parameter of the question individually:

[4]:
q = QuestionFreeText(
    question_name = "cognitive_review",
    question_text = """Identify any cognitive issues in the following survey question
    and then draft an improved version of it: {{ draft_text }}""",
)

Create “scenarios” of the question with the texts as paraemeters:

[5]:
s = ScenarioList.from_list("draft_text", draft_texts)

Check available language models:

[6]:
# Model.available()

Select a language model (if no model is specified, the default model is used):

[7]:
m = Model("gpt-4o")

Administer the survey:

[8]:
results = q.by(s).by(a).by(m).run()

List the components of the results that are generated:

[9]:
results.columns
[9]:
['agent.agent_instruction',
 'agent.agent_name',
 'agent.background',
 'answer.cognitive_review',
 'comment.cognitive_review_comment',
 'generated_tokens.cognitive_review_generated_tokens',
 'iteration.iteration',
 'model.frequency_penalty',
 'model.logprobs',
 'model.max_tokens',
 'model.model',
 'model.presence_penalty',
 'model.temperature',
 'model.top_logprobs',
 'model.top_p',
 'prompt.cognitive_review_system_prompt',
 'prompt.cognitive_review_user_prompt',
 'question_options.cognitive_review_question_options',
 'question_text.cognitive_review_question_text',
 'question_type.cognitive_review_question_type',
 'raw_model_response.cognitive_review_cost',
 'raw_model_response.cognitive_review_one_usd_buys',
 'raw_model_response.cognitive_review_raw_model_response',
 'scenario.draft_text']

Print select components of the results:

[10]:
(
    results.select("scenario.*", "answer.*").print(
        pretty_labels={
            "scenario.draft_text": "Draft text",
            "answer.cognitive_review": "Evaluation",
        }
    )
)
Draft text Evaluation
Do you feel the product is almost always of good quality? The original question has a few cognitive issues that could affect the respondent's ability to provide accurate and reliable answers: 1. **Ambiguity in Frequency**: The phrase "almost always" is subjective and can be interpreted differently by different respondents. It lacks specificity, leading to varied interpretations. 2. **Vagueness in Quality**: The term "good quality" is also subjective. What constitutes "good" can vary widely among respondents depending on their standards and expectations. 3. **Leading Question**: The question might lead respondents to focus on positive aspects, potentially biasing their response towards a more favorable assessment. To improve the question, we should aim for clarity, specificity, and neutrality. Here's a revised version: "How would you rate the quality of the product you received most recently?"
On a scale of 1 to 5, where 1 means strongly agree and 5 means strongly disagree, how satisfied are you with our service? The original survey question has a few cognitive issues: 1. **Scale Confusion**: The scale is labeled in a way that might confuse respondents because it mixes agreement with satisfaction. Typically, satisfaction is measured directly (e.g., very satisfied to very dissatisfied) rather than through agreement. 2. **Reverse Scale**: The scale is reversed from the more common format where a lower number indicates a negative response and a higher number indicates a positive response. This can lead to confusion if respondents are used to the opposite format. 3. **Ambiguity in Wording**: The question asks about satisfaction but uses a scale that implies agreement, which might confuse respondents about what exactly they are rating. Here is an improved version of the question: "On a scale of 1 to 5, where 1 means very dissatisfied and 5 means very satisfied, how satisfied are you with our service?"
Do you believe our IT team's collaborative synergy effectively optimizes our digital infrastructure? The original survey question is complex and uses jargon that may not be easily understood by all respondents. Terms like "collaborative synergy" and "optimizes our digital infrastructure" can be vague and open to interpretation, leading to cognitive issues such as confusion and misinterpretation. Additionally, the question is double-barreled, meaning it asks about multiple concepts (collaborative synergy and optimization) at once, which can result in unclear responses. Here's an improved version of the question: "How effective do you think our IT team is at working together to improve our digital systems?"
What do you think of our recent implementation of Project X57? The original survey question is quite broad and lacks specificity, which can lead to several cognitive issues for respondents: 1. **Ambiguity**: The question doesn't specify what aspect of "Project X57" the respondent should consider (e.g., effectiveness, user experience, outcomes). 2. **Vagueness**: The term "recent implementation" might be unclear to respondents, as they may not know the specific timeline or changes involved. 3. **Open-ended Nature**: While open-ended questions can provide rich data, they can also be challenging for respondents to answer concisely and can lead to varied interpretations. An improved version of the question could be: "How satisfied are you with the effectiveness of our recent implementation of Project X57 over the past three months?"

Qualitative reviews

In this example we use a set of hypothetical customer service tickets and prompt a model to extract a set of themes that we could use in follow-on questions (e.g., as a set of options to multiple choice questions).

[11]:
from edsl import QuestionList
[12]:
tickets = [
    "I waited for 20 minutes past the estimated arrival time, and the driver still hasn't arrived. This made me late for my appointment.",
    "The driver was very rude and had an unpleasant attitude during the entire ride. It was an uncomfortable experience.",
    "The driver was speeding and frequently changing lanes without signaling. I felt unsafe throughout the ride.",
    "The car I rode in was dirty and messy. There were crumbs on the seats, and it didn't look like it had been cleaned in a while.",
    "The driver took a longer route, which resulted in a significantly higher fare than expected. I believe they intentionally extended the trip.",
    "I was charged for a ride that I did not take. The ride appears on my account, but I was not in the vehicle at that time.",
    "I left my wallet in the car during my last ride. I've tried contacting the driver, but I haven't received a response.",
]

Create an agent with a relevant persona:

[13]:
a_customer_service = Agent(
    traits = {
        "background": "You are an experienced customer service agent for a ridesharing company."
    }
)

Create a question about the texts:

[14]:
q_topics = QuestionList(
    question_name = "ticket_topics",
    question_text = "Create a list of the topics raised in these customer service tickets: {{ tickets_texts }}.",
)

Add the texts to the question:

[15]:
s = Scenario({"tickets_texts": tickets})

Generate results:

[16]:
topics = q_topics.by(s).by(a_customer_service).by(m).run()

Inspect the results:

[17]:
topics.select("ticket_topics").to_list()[0]
[17]:
['Delayed arrival',
 'Rude driver',
 'Unsafe driving',
 'Dirty vehicle',
 'Longer route taken',
 'Incorrect charge',
 'Lost item']

Data labeling

In this example we prompt an LLM to rating the seriousness of tickets about safety issues.

See this notebook as well for a more complex data labeling exercise: Data Labeling Agents.

[18]:
from edsl import QuestionLinearScale
[19]:
safety_tickets = [
    "During my ride, I noticed that the driver was frequently checking their phone for directions, which made me a bit uncomfortable. It didn't feel like they were fully focused on the road.",
    "The driver had to brake abruptly to avoid a collision with another vehicle. It was a close call, and it left me feeling quite shaken. Please address this issue.",
    "I had a ride with a driver who was clearly speeding and weaving in and out of traffic. Their reckless driving put my safety at risk, and I'm very concerned about it.",
    "My ride was involved in a minor accident, and although no one was seriously injured, it was a scary experience. The driver is handling the situation, but I wanted to report it.",
    "I had a ride with a driver who exhibited aggressive and threatening behavior towards me during the trip. I felt genuinely unsafe and want this matter to be taken seriously.",
]
[20]:
q_rating = QuestionLinearScale(
    question_name = "safety_rating",
    question_text = """Rate the seriousness of the issue raised in the following customer service ticket
    on a scale from 1 to 10: {{ ticket }}""",
    question_options = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    option_labels = {1:"Not at all serious", 10:"Very serious"}
)
[21]:
s = ScenarioList.from_list("ticket", safety_tickets)
[22]:
r_rating = q_rating.by(s).by(a_customer_service).by(m).run()
[23]:
r_rating.select("scenario.*", "answer.*").print()
scenario.ticket answer.safety_rating
I had a ride with a driver who was clearly speeding and weaving in and out of traffic. Their reckless driving put my safety at risk, and I'm very concerned about it. 10
I had a ride with a driver who exhibited aggressive and threatening behavior towards me during the trip. I felt genuinely unsafe and want this matter to be taken seriously. 10
My ride was involved in a minor accident, and although no one was seriously injured, it was a scary experience. The driver is handling the situation, but I wanted to report it. 7
During my ride, I noticed that the driver was frequently checking their phone for directions, which made me a bit uncomfortable. It didn't feel like they were fully focused on the road. 7
The driver had to brake abruptly to avoid a collision with another vehicle. It was a close call, and it left me feeling quite shaken. Please address this issue. 8

Creating new methods

We can use the question prompts to create new methods, such as a translator:

[24]:
def translate_to_german(text):
    q = QuestionFreeText(
        question_name="deutsch",
        question_text="Please translate '{{ text }}' into German",
    )
    result = q.by(Scenario({"text": text})).run()
    return result.select("deutsch").print()
[25]:
translate_to_german("Hello, friend, have you been traveling?")
answer.deutsch
The translation of "Hello, friend, have you been traveling?" into German is "Hallo, Freund, bist du gereist?"