Cognitive testing & creating new methods
This notebook shows some ways of using EDSL to conduct research, including data labeling, cognitive testing and creating new methods.
Cognitive testing
In this example we use the tools to evaluate some draft survey questions and suggest improvements.
[1]:
from edsl import QuestionFreeText, Agent, ScenarioList, Scenario, Model
Create a relevant persona and assign it to an agent:
[2]:
a = Agent(traits = {"background": "You are an expert in survey methodology and evaluating questionnaires."})
Identify a set of texts for review (these can also be imported):
[3]:
draft_texts = [
"Do you feel the product is almost always of good quality?",
"On a scale of 1 to 5, where 1 means strongly agree and 5 means strongly disagree, how satisfied are you with our service?",
"Do you believe our IT team's collaborative synergy effectively optimizes our digital infrastructure?",
"What do you think of our recent implementation of Project X57?",
]
Construct a question about the texts, which will be added as a parameter of the question individually:
[4]:
q = QuestionFreeText(
question_name = "cognitive_review",
question_text = """Identify any cognitive issues in the following survey question
and then draft an improved version of it: {{ draft_text }}""",
)
Create “scenarios” of the question with the texts as paraemeters:
[5]:
s = ScenarioList.from_list("draft_text", draft_texts)
Check available language models:
[6]:
# Model.available()
Select a language model (if no model is specified, the default model is used):
[7]:
m = Model("gemini-1.5-flash")
Administer the survey:
[8]:
results = q.by(s).by(a).by(m).run()
Job UUID | 80119162-96b9-4f7b-a6d9-a1bfc9d45e57 |
Progress Bar URL | https://www.expectedparrot.com/home/remote-job-progress/80119162-96b9-4f7b-a6d9-a1bfc9d45e57 |
Error Report URL | None |
Results UUID | e01f3aaa-a829-493a-932e-dcd87a915a60 |
Results URL | None |
List the components of the results that are generated:
[9]:
results.columns
[9]:
0 | |
---|---|
0 | agent.agent_instruction |
1 | agent.agent_name |
2 | agent.background |
3 | answer.cognitive_review |
4 | comment.cognitive_review_comment |
5 | generated_tokens.cognitive_review_generated_tokens |
6 | iteration.iteration |
7 | model.maxOutputTokens |
8 | model.model |
9 | model.stopSequences |
10 | model.temperature |
11 | model.topK |
12 | model.topP |
13 | prompt.cognitive_review_system_prompt |
14 | prompt.cognitive_review_user_prompt |
15 | question_options.cognitive_review_question_options |
16 | question_text.cognitive_review_question_text |
17 | question_type.cognitive_review_question_type |
18 | raw_model_response.cognitive_review_cost |
19 | raw_model_response.cognitive_review_one_usd_buys |
20 | raw_model_response.cognitive_review_raw_model_response |
21 | scenario.draft_text |
Print select components of the results:
[10]:
(
results.select("scenario.*", "answer.*").print(
pretty_labels={
"scenario.draft_text": "Draft text",
"answer.cognitive_review": "Evaluation",
}
)
)
[10]:
Draft text | Evaluation | |
---|---|---|
0 | Do you feel the product is almost always of good quality? | Okay, let's look at that question. "Do you feel the product is *almost always* of good quality?" Hmm, there are a couple of cognitive issues here. First, "almost always" is a bit vague. It's a subjective term, and what constitutes "almost always" can vary wildly from person to person. One respondent might interpret it as 95% of the time, while another might think it means 70%. This lack of precision introduces response bias; we're not getting consistent, comparable data. Secondly, the question is leading. It presupposes that the product *is* generally of good quality, nudging respondents towards a positive answer. A more neutral phrasing would be better. Here's how I'd improve it: **Improved Version:** "How would you rate the quality of the product you recently purchased?" (Followed by a rating scale, e.g., 1-5 stars, or a scale from "Very Poor" to "Excellent"). |
1 | On a scale of 1 to 5, where 1 means strongly agree and 5 means strongly disagree, how satisfied are you with our service? | Okay, let's take a look at that question. There's a subtle but important cognitive issue lurking there. The problem lies in the scale's direction. It's using what's called a reverse-scored scale. While not inherently *wrong*, it's cognitively taxing for respondents. People naturally associate higher numbers with more positive things. By using "1 = strongly agree" and "5 = strongly disagree", you're forcing respondents to mentally invert the scale every time they answer. This increases the chance of errors, especially for those who may not be fully engaged or who are rushing through the survey. It introduces an extra processing step that can lead to response bias. Someone might quickly glance at the scale, see a "5", and associate it with high satisfaction because they're used to higher numbers meaning "better". They'll then incorrectly select "5" when they actually mean "1". Here's an improved version: "On a scale of 1 to 5, where 1 means strongly disagree and 5 means strongly agree, how satisfied are you with our service?" |
2 | Do you believe our IT team's collaborative synergy effectively optimizes our digital infrastructure? | Oh boy, that question is a mess! It's got several cognitive issues that would seriously hamper getting useful data. Let's break it down: First, **jargon overload**. "Collaborative synergy" and "optimizes our digital infrastructure" are incredibly complex phrases. Most respondents won't have a clear understanding of what those terms mean in the context of their daily experience with the IT team. They might guess, leading to inconsistent and unreliable answers. It's basically asking respondents to interpret a mini-essay before even answering. Second, **double-barreled question**. It asks about *two* separate things: collaboration and effectiveness. If someone thinks the team collaborates well but the infrastructure isn't optimized, how are they supposed to answer? They're forced to pick one aspect to focus on, obscuring the nuances of their opinion. Third, **leading question**. The phrasing implies a positive assessment is expected. It's not neutral; it subtly pressures respondents towards a "yes" answer. This biases the results and prevents honest, critical feedback. Finally, **lack of specificity**. What exactly constitutes "effective optimization"? What aspects of the digital infrastructure are we talking about? Email? Network speed? Software availability? The vagueness makes it difficult for respondents to form a concrete opinion and answer meaningfully. Here's how I'd improve it, breaking it into multiple, simpler questions: **Improved Questions:** 1. **Regarding the IT team's collaboration, how would you rate their teamwork and communication?** (Scale: 1-5, 1 being very poor, 5 being excellent) This directly addresses collaboration without jargon. 2. **How satisfied are you with the following aspects of our digital infrastructure?** (Multiple questions, one for each aspect, using a 1-5 scale): * Email reliability * Network speed and stability * Access to necessary software and tools * IT support responsiveness * ... (Add other relevant aspects) |
3 | What do you think of our recent implementation of Project X57? | Okay, let's take a look at that question. "What do you think of our recent implementation of Project X57?" Hmm, there are a few cognitive issues lurking here. First, it's incredibly **open-ended**. It gives respondents *no guidance* whatsoever. They could answer with anything from a single word ("Great!") to a rambling, unstructured narrative. This makes analysis a nightmare. We'll get a huge variety of responses, making it hard to identify trends or meaningful patterns. The sheer volume of qualitative data will be difficult to manage and summarize effectively. Second, the question assumes a level of familiarity with "Project X57" that might not exist. Some respondents might not even know what it *is*, leading to inaccurate or irrelevant responses. Even if they've *heard* of it, they might not have enough information to form a meaningful opinion. Third, the word "implementation" is a bit vague. Did they think the *planning* of the implementation was poor? The *rollout*? The *training*? The *communication*? The question doesn't specify what aspect of the implementation they should focus on. This lack of clarity leads to inconsistent answers and makes it hard to compare responses. Here's how I'd improve the question, addressing these cognitive issues: **Improved Version:** "Thinking about our recent implementation of Project X57, please rate your level of agreement with the following statements using the scale below. (1=Strongly Disagree, 5=Strongly Agree)" Then, I'd follow with a series of *specific* statements, each focusing on a different aspect of the implementation: * The project was well-planned. * The project was implemented efficiently. * I received adequate training on the project. * Communication about the project was clear and timely. * The project has improved [mention specific benefit, e.g., workflow efficiency, customer satisfaction]. * I am satisfied with the overall outcome of the project. This revised approach uses a **Likert scale**, which is much easier to analyze quantitatively. It breaks down the broad question into more manageable, specific components, addressing the vagueness and providing a structured response format. It also ensures that everyone is answering the same basic questions, leading to more reliable and comparable data. Finally, by focusing on specific aspects, we get a much clearer picture of what aspects of the implementation were successful and which ones need improvement. |
Qualitative reviews
In this example we use a set of hypothetical customer service tickets and prompt a model to extract a set of themes that we could use in follow-on questions (e.g., as a set of options to multiple choice questions).
[11]:
from edsl import QuestionList
[12]:
tickets = [
"I waited for 20 minutes past the estimated arrival time, and the driver still hasn't arrived. This made me late for my appointment.",
"The driver was very rude and had an unpleasant attitude during the entire ride. It was an uncomfortable experience.",
"The driver was speeding and frequently changing lanes without signaling. I felt unsafe throughout the ride.",
"The car I rode in was dirty and messy. There were crumbs on the seats, and it didn't look like it had been cleaned in a while.",
"The driver took a longer route, which resulted in a significantly higher fare than expected. I believe they intentionally extended the trip.",
"I was charged for a ride that I did not take. The ride appears on my account, but I was not in the vehicle at that time.",
"I left my wallet in the car during my last ride. I've tried contacting the driver, but I haven't received a response.",
]
Create an agent with a relevant persona:
[13]:
a_customer_service = Agent(
traits = {
"background": "You are an experienced customer service agent for a ridesharing company."
}
)
Create a question about the texts:
[14]:
q_topics = QuestionList(
question_name = "ticket_topics",
question_text = "Create a list of the topics raised in these customer service tickets: {{ tickets_texts }}.",
)
Add the texts to the question:
[15]:
s = Scenario({"tickets_texts": tickets})
Generate results:
[16]:
topics = q_topics.by(s).by(a_customer_service).by(m).run()
Job UUID | 2d0421fd-1dbd-4511-8a62-02b084503273 |
Progress Bar URL | https://www.expectedparrot.com/home/remote-job-progress/2d0421fd-1dbd-4511-8a62-02b084503273 |
Error Report URL | None |
Results UUID | 59f6a748-6212-4151-a4e9-cc1c6a8d0ec4 |
Results URL | None |
Inspect the results:
[17]:
topics.select("ticket_topics").to_list()[0]
[17]:
['Excessive wait time',
"Driver's rude behavior",
'Unsafe driving',
'Vehicle cleanliness',
'Unnecessarily long route/fare dispute',
'Incorrect/fraudulent charge',
'Lost item in vehicle']
Data labeling
In this example we prompt an LLM to rating the seriousness of tickets about safety issues.
See this notebook as well for a more complex data labeling exercise: Data Labeling Agents.
[18]:
from edsl import QuestionLinearScale
[19]:
safety_tickets = [
"During my ride, I noticed that the driver was frequently checking their phone for directions, which made me a bit uncomfortable. It didn't feel like they were fully focused on the road.",
"The driver had to brake abruptly to avoid a collision with another vehicle. It was a close call, and it left me feeling quite shaken. Please address this issue.",
"I had a ride with a driver who was clearly speeding and weaving in and out of traffic. Their reckless driving put my safety at risk, and I'm very concerned about it.",
"My ride was involved in a minor accident, and although no one was seriously injured, it was a scary experience. The driver is handling the situation, but I wanted to report it.",
"I had a ride with a driver who exhibited aggressive and threatening behavior towards me during the trip. I felt genuinely unsafe and want this matter to be taken seriously.",
]
[20]:
q_rating = QuestionLinearScale(
question_name = "safety_rating",
question_text = """Rate the seriousness of the issue raised in the following customer service ticket
on a scale from 1 to 10: {{ ticket }}""",
question_options = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
option_labels = {1:"Not at all serious", 10:"Very serious"}
)
[21]:
s = ScenarioList.from_list("ticket", safety_tickets)
[22]:
r_rating = q_rating.by(s).by(a_customer_service).by(m).run()
Job UUID | 93f570ef-b68c-40fd-a212-c11cfa5db3f6 |
Progress Bar URL | https://www.expectedparrot.com/home/remote-job-progress/93f570ef-b68c-40fd-a212-c11cfa5db3f6 |
Error Report URL | None |
Results UUID | 3cf148e7-83ba-4506-baeb-3d49b0db6808 |
Results URL | None |
[23]:
r_rating.select("scenario.*", "answer.*")
[23]:
scenario.ticket | answer.safety_rating | |
---|---|---|
0 | During my ride, I noticed that the driver was frequently checking their phone for directions, which made me a bit uncomfortable. It didn't feel like they were fully focused on the road. | 7 |
1 | The driver had to brake abruptly to avoid a collision with another vehicle. It was a close call, and it left me feeling quite shaken. Please address this issue. | 8 |
2 | I had a ride with a driver who was clearly speeding and weaving in and out of traffic. Their reckless driving put my safety at risk, and I'm very concerned about it. | 10 |
3 | My ride was involved in a minor accident, and although no one was seriously injured, it was a scary experience. The driver is handling the situation, but I wanted to report it. | 8 |
4 | I had a ride with a driver who exhibited aggressive and threatening behavior towards me during the trip. I felt genuinely unsafe and want this matter to be taken seriously. | 10 |
Creating new methods
We can use the question prompts to create new methods, such as a translator:
[24]:
def translate_to_german(text):
q = QuestionFreeText(
question_name="deutsch",
question_text="Please translate '{{ text }}' into German",
)
result = q.by(Scenario({"text": text})).run()
return result.select("deutsch").print()
[25]:
translate_to_german("Hello, friend, have you been traveling?")
Job UUID | 6d570ed6-ad70-41dc-9f6e-76a7371211b0 |
Progress Bar URL | https://www.expectedparrot.com/home/remote-job-progress/6d570ed6-ad70-41dc-9f6e-76a7371211b0 |
Error Report URL | None |
Results UUID | d13fdead-d992-45f7-a5d0-45c26d7eba8f |
Results URL | None |
[25]:
answer.deutsch | |
---|---|
0 | The translation of "Hello, friend, have you been traveling?" into German is "Hallo, Freund, bist du gereist?" |
Posting to Coop
[27]:
from edsl import Notebook
n = Notebook("research_methods.ipynb")
info = n.push(description = "Using EDSL to create research methods", visibility = "public")
info
[27]:
{'description': 'Using EDSL to create research methods',
'object_type': 'notebook',
'url': 'https://www.expectedparrot.com/content/9e6f1c95-cde7-45f0-9cf1-7aef7a34ba94',
'uuid': '9e6f1c95-cde7-45f0-9cf1-7aef7a34ba94',
'version': '0.1.39.dev2',
'visibility': 'public'}
To update an object at Coop:
[28]:
n = Notebook("research_methods.ipynb") # resave
n.patch(uuid = info["uuid"], value = n)
[28]:
{'status': 'success'}