Results
A Results object represents the outcome of running a Survey. It contains a list of individual Result objects, where each Result corresponds to a response to the survey for a unique combination of Agent, Model, and Scenario objects used with the survey.
For example, if a survey (of one more more questions) is administered to 2 agents and 2 language models (without any scenarios for the questions), the Results will contain 4 Result objects: one for each combination of agent and model used with the survey. If the survey questions are parameterized with 2 scenarios, the Results will expand to include 8 Result objects, accounting for all combinations of agents, models, and scenarios.
Generating results
A Results object is not typically instantiated directly, but is returned by calling the run() method of a Survey after any agents, language models and scenarios are added to it.
In order to demonstrate how to access and interact with results, we use the following code to generate results for a simple survey. Note that specifying agent traits, scenarios (question parameter values) and language models is optional, and we include those steps here for illustration purposes. See the Agents, Scenarios and models sections for more details on these components.
Note: You must store API keys for language models in order to generate results. Please see the Managing Keys section for instructions on activating Remote Inference or storing your own API keys for inference service providers.
To construct a survey we start by creating questions:
from edsl import QuestionLinearScale, QuestionMultipleChoice
q1 = QuestionLinearScale(
question_name = "important",
question_text = "On a scale from 1 to 5, how important to you is {{ topic }}?",
question_options = [0, 1, 2, 3, 4, 5],
option_labels = {0:"Not at all important", 5:"Very important"}
)
q2 = QuestionMultipleChoice(
question_name = "read",
question_text = "Have you read any books about {{ topic }}?",
question_options = ["Yes", "No", "I do not know"]
)
We combine them in a survey to administer them together:
from edsl import Survey
survey = Survey([q1, q2])
We have parameterized our questions, so we can use them with different scenarios:
from edsl import ScenarioList
scenarios = ScenarioList.from_list("topic", ["climate change", "house prices"])
We can optionally create agents with personas or other relevant traits to answer the survey:
from edsl import AgentList, Agent
agents = AgentList(
Agent(traits = {"persona": p}) for p in ["student", "celebrity"]
)
We can specify the language models that we want to use to generate responses:
from edsl import ModelList, Model
models = ModelList(
Model(m) for m in ["gemini-1.5-flash", "gpt-4o"]
)
Finally, we generate results by adding the scenarios, agents and models to the survey and calling the run() method:
results = survey.by(scenarios).by(agents).by(models).run()
For more details on each of the above steps, please see the Agents, Scenarios and models sections of the docs.
Result objects
We can check the number of Result objects created by inspecting the length of the Results:
len(results)
This will count 2 (scenarios) x 2 (agents) x 2 (models) = 8 Result objects:
8
Generating multiple results
If we want to generate multiple results for a survey–i.e., more than 1 result for each combination of Agent, Model and Scenario objects used–we can pass the desired number of iterations when calling the run() method. For example, the following code will generate 3 results for our survey (n=3):
results = survey.by(scenarios).by(agents).by(models).run(n=3)
We can verify that the number of Result objects created is now 24 = 3 iterations x 2 scenarios x 2 agents x 2 models:
len(results)
24
We can readily inspect a result:
results[0]
Output:
key |
value |
---|---|
agent:traits |
{‘persona’: ‘student’} |
scenario:topic |
climate change |
model:model |
gemini-1.5-flash |
model:parameters |
{‘temperature’: 0.5, ‘topP’: 1, ‘topK’: 1, ‘maxOutputTokens’: 2048, ‘stopSequences’: []} |
iteration |
0 |
answer:important |
5 |
answer:read |
Yes |
prompt:important_user_prompt |
{‘text’: ‘On a scale from 1 to 5, how important to you is climate change?nn0 : Not at all importantnn1 : nn2 : nn3 : nn4 : nn5 : Very importantnnOnly 1 option may be selected.nnRespond only with the code corresponding to one of the options. E.g., “1” or “5” by itself.nnAfter the answer, you can put a comment explaining why you chose that option on the next line.’, ‘class_name’: ‘Prompt’} |
prompt:important_system_prompt |
{‘text’: “You are answering questions as if you were a human. Do not break character. Your traits: {‘persona’: ‘student’}”, ‘class_name’: ‘Prompt’} |
prompt:read_user_prompt |
{‘text’: ‘nHave you read any books about climate change?nn nYesn nNon nI do not known nnOnly 1 option may be selected.nnRespond only with a string corresponding to one of the options.nnnAfter the answer, you can put a comment explaining why you chose that option on the next line.’, ‘class_name’: ‘Prompt’} |
prompt:read_system_prompt |
{‘text’: “You are answering questions as if you were a human. Do not break character. Your traits: {‘persona’: ‘student’}”, ‘class_name’: ‘Prompt’} |
raw_model_response:important_raw_model_response |
{‘candidates’: [{‘content’: {‘parts’: [{‘text’: “5nnIt’s, like, a huge deal. The future of the planet is at stake, you know? We’re talking about everything from extreme weather to rising sea levels – it affects everyone, and it’s something we all need to be seriously concerned about.n”}], ‘role’: ‘model’}, ‘finish_reason’: 1, ‘safety_ratings’: [{‘category’: 8, ‘probability’: 1, ‘blocked’: False}, {‘category’: 10, ‘probability’: 1, ‘blocked’: False}, {‘category’: 7, ‘probability’: 1, ‘blocked’: False}, {‘category’: 9, ‘probability’: 1, ‘blocked’: False}], ‘avg_logprobs’: -0.19062816490561274, ‘token_count’: 0, ‘grounding_attributions’: []}], ‘usage_metadata’: {‘prompt_token_count’: 129, ‘candidates_token_count’: 59, ‘total_token_count’: 188, ‘cached_content_token_count’: 0}} |
raw_model_response:important_cost |
0.000027 |
raw_model_response:important_one_usd_buys |
36529.685735 |
raw_model_response:read_raw_model_response |
{‘candidates’: [{‘content’: {‘parts’: [{‘text’: “YesnnI’ve read a few articles and some chapters from textbooks for my environmental science class, which touched upon climate change. It’s not exactly the same as reading a whole book dedicated to the topic, but it counts, right?n”}], ‘role’: ‘model’}, ‘finish_reason’: 1, ‘safety_ratings’: [{‘category’: 8, ‘probability’: 1, ‘blocked’: False}, {‘category’: 10, ‘probability’: 1, ‘blocked’: False}, {‘category’: 7, ‘probability’: 1, ‘blocked’: False}, {‘category’: 9, ‘probability’: 1, ‘blocked’: False}], ‘avg_logprobs’: -0.13118227790383732, ‘token_count’: 0, ‘grounding_attributions’: []}], ‘usage_metadata’: {‘prompt_token_count’: 96, ‘candidates_token_count’: 51, ‘total_token_count’: 147, ‘cached_content_token_count’: 0}} |
raw_model_response:read_cost |
0.000022 |
raw_model_response:read_one_usd_buys |
44444.451200 |
question_to_attributes:important |
{‘question_text’: ‘On a scale from 1 to 5, how important to you is {{ topic }}?’, ‘question_type’: ‘linear_scale’, ‘question_options’: [0, 1, 2, 3, 4, 5]} |
question_to_attributes:read |
{‘question_text’: ‘Have you read any books about {{ topic }}?’, ‘question_type’: ‘multiple_choice’, ‘question_options’: [‘Yes’, ‘No’, ‘I do not know’]} |
generated_tokens:important_generated_tokens |
5 It’s, like, a huge deal. The future of the planet is at stake, you know? We’re talking about everything from extreme weather to rising sea levels – it affects everyone, and it’s something we all need to be seriously concerned about. |
generated_tokens:read_generated_tokens |
Yes I’ve read a few articles and some chapters from textbooks for my environmental science class, which touched upon climate change. It’s not exactly the same as reading a whole book dedicated to the topic, but it counts, right? |
comments_dict:important_comment |
It’s, like, a huge deal. The future of the planet is at stake, you know? We’re talking about everything from extreme weather to rising sea levels – it affects everyone, and it’s something we all need to be seriously concerned about. |
comments_dict:read_comment |
I’ve read a few articles and some chapters from textbooks for my environmental science class, which touched upon climate change. It’s not exactly the same as reading a whole book dedicated to the topic, but it counts, right? |
cache_keys:important |
98d6961d0529335b74f2363ba9b7a8de |
cache_keys:read |
12af825953d89c1f776bd3af40e37cfb |
Results components
Results contain components that can be accessed and analyzed individually or collectively. We can see a list of these components by calling the columns method:
results.columns
The following list will be returned for the results generated by the above code:
The columns include information about each agent, model and corresponding prompts used to simulate the answer to each question and scenario in the survey, together with each raw model response. If the survey was run multiple times (run(n=<integer>)) then the iteration.iteration column will show the iteration number for each result.
Agent information:
agent.instruction: The instruction for the agent. This field is the optional instruction that was passed to the agent when it was created.
agent.agent_name: This field is always included in any Results object. It contains a unique identifier for each Agent that can be specified when an agent is is created (Agent(name=<name>, traits={<traits_dict>})). If not specified, it is added automatically when results are generated (in the form Agent_0, etc.).
agent.persona: Each of the traits that we pass to an agent is represented in a column of the results. Our example code created a “persona” trait for each agent, so our results include a “persona” column for this information. Note that the keys for the traits dictionary should be a valid Python keys.
Answer information:
answer.important: Agent responses to the linear scale important question.
answer.read: Agent responses to the multiple choice read question.
Cache information:
cache_keys.important_cache_key: The cache key for the important question.
cache_keys.important_cache_used: Whether the existing cache was used for the important question.
cache_keys.read_cache_key: The cache key for the read question.
cache_keys.read_cache_used: Whether the existing cache was used for the read question.
Comment information:
A “comment” field is automatically included for every question in a survey other than free text questions, to allow the model to provide additional information about its response. The default instruction for the agent to provide a comment is included in user_prompt for a question, and can be modified or omitted when creating the question. (See the Prompts section for details on modifying user and system prompts, and information about prompts in results below. Comments can also be automatically excluded by passing a parameter include_comment=False a question when creating it.)
comment.important_comment: Agent commentary on responses to the important question.
comment.read_comment: Agent commentary on responses to the read question.
Generated tokens information:
generated_tokens.important_generated_tokens: The generated tokens for the important question.
generated_tokens.read_generated_tokens: The generated tokens for the read question.
Iteration information:
The iteration column shows the number of the run (run(n=<integer>)) for the combination of components used (scenarios, agents and models).
Model information:
Each of model columns is a modifiable parameter of the models used to generate the responses.
model.frequency_penalty: The frequency penalty for the model.
model.logprobs: The logprobs for the model.
model.maxOutputTokens: The maximum number of output tokens for the model.
model.max_tokens: The maximum number of tokens for the model.
model.model: The name of the model used.
model.presence_penalty: The presence penalty for the model.
model.stopSequences: The stop sequences for the model.
model.temperature: The temperature for the model.
model.topK: The top k for the model.
model.topP: The top p for the model.
model.top_logprobs: The top logprobs for the model.
model.top_p: The top p for the model.
model.use_cache: Whether the model uses cache.
Note: Some of the above fields are particular to specific models, and may have different names (e.g., top_p vs. topP).
Prompt information:
prompt.important_system_prompt: The system prompt for the important question.
prompt.important_user_prompt: The user prompt for the important question.
prompt.read_system_prompt: The system prompt for the read question.
prompt.read_user_prompt: The user prompt for the read question.
For more details about prompts, please see the Prompts section.
Question information:
question_options.important_question_options: The options for the important question, if any.
question_options.read_question_options: The options for the read question, if any.
question_text.important_question_text: The text of the important question.
question_text.read_question_text: The text of the read question.
question_type.important_question_type: The type of the important question.
question_type.read_question_type: The type of the read question.
Raw model response information:
raw_model_response.important_cost: The cost of the result for the important question, applying the token quanities & prices.
raw_model_response.important_one_usd_buys: The number of identical results for the important question that 1USD would cover.
raw_model_response.important_raw_model_response: The raw model response for the important question.
raw_model_response.read_cost: The cost of the result for the read question, applying the token quanities & prices.
raw_model_response.read_one_usd_buys: The number of identical results for the read question that 1USD would cover.
raw_model_response.read_raw_model_response: The raw model response for the read question.
Note that the cost of a result for a question is specific to the components (scenario, agent, model used with it).
Scenario information:
scenario.scenario_index: The index of the scenario.
scenario.topic: The values provided for the “topic” scenario for the questions.
Creating tables by selecting columns
Each of these columns can be accessed directly by calling the select() method and passing the column names. Alternatively, we can specify the columns to exclude by calling the drop() method. These methods can be chained together to display the specified columns in a table format.
For example, the following code will print a table showing the answers for read and important together with model, persona and topic columns (because the column names are unique we can drop the model, agent, scenario and answer prefixes when selecting them):
results = survey.by(scenarios).by(agents).by(models).run() # Running the survey once
results.select("model", "persona", "topic", "read", "important")
A table with the selected columns will be printed:
model.model |
agent.persona |
scenario.topic |
answer.read |
answer.important |
---|---|---|---|---|
gemini-1.5-flash |
student |
climate change |
Yes |
5 |
gpt-4o |
student |
climate change |
Yes |
5 |
gemini-1.5-flash |
student |
house prices |
No |
1 |
gpt-4o |
student |
house prices |
No |
3 |
gemini-1.5-flash |
celebrity |
climate change |
Yes |
5 |
gpt-4o |
celebrity |
climate change |
Yes |
5 |
gemini-1.5-flash |
celebrity |
house prices |
Yes |
3 |
gpt-4o |
celebrity |
house prices |
No |
3 |
Sorting results
We can sort the columns by calling the sort_by method and passing it the column names to sort by:
(
results
.sort_by("model", "persona", reverse=False)
.select("model", "persona", "topic", "read", "important")
)
The following table will be printed:
model.model |
agent.persona |
scenario.topic |
answer.read |
answer.important |
---|---|---|---|---|
gemini-1.5-flash |
celebrity |
climate change |
Yes |
5 |
gemini-1.5-flash |
celebrity |
house prices |
Yes |
3 |
gemini-1.5-flash |
student |
climate change |
Yes |
5 |
gemini-1.5-flash |
student |
house prices |
No |
1 |
gpt-4o |
celebrity |
climate change |
Yes |
5 |
gpt-4o |
celebrity |
house prices |
No |
3 |
gpt-4o |
student |
climate change |
Yes |
5 |
gpt-4o |
student |
house prices |
No |
3 |
Labeling results
We can also add some table labels by passing a dictionary to the pretty_labels argument of the print method (note that we need to include the column prefixes when specifying the table labels, as shown below):
(
results
.sort_by("model", "persona", reverse=True)
.select("model", "persona", "topic", "read", "important")
.print(pretty_labels={
"model.model": "LLM",
"agent.persona": "Agent",
"scenario.topic": "Topic",
"answer.read": q2.question_text,
"answer.important": q1.question_text
}, format="rich")
)
The following table will be printed:
LLM |
Agent |
Topic |
Have you read any books about {{ topic }}? |
On a scale from 1 to 5, how important to you is {{ topic }}? |
---|---|---|---|---|
gpt-4o |
student |
climate change |
Yes |
5 |
gpt-4o |
student |
house prices |
No |
3 |
gpt-4o |
celebrity |
climate change |
Yes |
5 |
gpt-4o |
celebrity |
house prices |
No |
3 |
gemini-1.5-flash |
student |
climate change |
Yes |
5 |
gemini-1.5-flash |
student |
house prices |
No |
1 |
gemini-1.5-flash |
celebrity |
climate change |
Yes |
5 |
gemini-1.5-flash |
celebrity |
house prices |
Yes |
3 |
Filtering results
Results can be filtered by using the filter method and passing it a logical expression identifying the results that should be selected. For example, the following code will filter results where the answer to important is “5” and then just print the topic and important_comment columns:
(
results
.filter("important == 5")
.select("topic", "important", "important_comment")
)
This will return an abbreviated table:
scenario.topic |
answer.important |
comment.important_comment |
---|---|---|
climate change |
5 |
It’s, like, a huge deal. The future of the planet is at stake, and that affects everything – from the environment to the economy to social justice. It’s something I worry about a lot. |
climate change |
5 |
As a student, I’m really concerned about climate change because it affects our future and the planet we’ll inherit. It’s crucial to understand and address it to ensure a sustainable world for generations to come. |
climate change |
5 |
It’s a huge issue, you know? We only have one planet, and if we don’t take care of it, what kind of world are we leaving for future generations? It’s not just about polar bears; it’s about everything. It’s my responsibility, as someone with a platform, to speak out about it. |
climate change |
5 |
Climate change is a critical issue that affects everyone globally, and as a public figure, I believe it’s important to use my platform to raise awareness and advocate for sustainable practices. |
Note: The filter method allows us to pass the unique short names of the columns (without the prefixes) when specifying the logical expression. However, because the model.model column name is also a prefix, we need to include the prefix when filtering by this column, as shown in the example below:
(
results
.filter("model.model == 'gpt-4o'")
.select("model", "persona", "topic", "read", "important")
)
This will return a table of results where the model is “gpt-4o”:
model.model |
agent.persona |
scenario.topic |
answer.read |
answer.important |
---|---|---|---|---|
gpt-4o |
student |
climate change |
Yes |
5 |
gpt-4o |
student |
house prices |
No |
3 |
gpt-4o |
celebrity |
climate change |
Yes |
5 |
gpt-4o |
celebrity |
house prices |
No |
3 |
Limiting results
We can select and print a limited number of results by passing the desired number of max_rows to the print() method. This can be useful for quickly checking the first few results:
(
results
.select("model", "persona", "topic", "read", "important")
.print(max_rows=4, format="rich")
)
This will return a table of the selected components of the first 4 results:
model.model |
agent.persona |
scenario.topic |
answer.read |
answer.important |
---|---|---|---|---|
gemini-1.5-flash |
student |
climate change |
Yes |
5 |
gpt-4o |
student |
climate change |
Yes |
5 |
gemini-1.5-flash |
student |
house prices |
No |
1 |
gpt-4o |
student |
house prices |
No |
3 |
Sampling results
We can select a sample of n results by passing the desired number of random results to the sample() method. This can be useful for checking a random subset of the results with different parameters:
sample_results = results.sample(2)
(
sample_results
.sort_by("model")
.select("model", "persona", "topic", "read", "important")
)
This will return a table of the specified number of randomly selected results:
model.model |
agent.persona |
scenario.topic |
answer.read |
answer.important |
---|---|---|---|---|
gpt-4o |
celebrity |
house prices |
No |
3 |
gpt-4o |
celebrity |
climate change |
Yes |
5 |
Shuffling results
We can shuffle results by calling the shuffle() method. This can be useful for quickly checking the first few results:
shuffle_results = results.shuffle()
(
shuffle_results
.select("model", "persona", "topic", "read", "important")
)
This will return a table of shuffled results:
model.model |
agent.persona |
scenario.topic |
answer.read |
answer.important |
---|---|---|---|---|
gemini-1.5-flash |
celebrity |
climate change |
Yes |
5 |
gpt-4o |
student |
house prices |
No |
3 |
gemini-1.5-flash |
celebrity |
house prices |
Yes |
3 |
gemini-1.5-flash |
student |
house prices |
No |
1 |
gpt-4o |
celebrity |
house prices |
No |
3 |
gpt-4o |
celebrity |
climate change |
Yes |
5 |
gpt-4o |
student |
climate change |
Yes |
5 |
gemini-1.5-flash |
student |
climate change |
Yes |
5 |
Adding results
We can add results together straightforwardly by using the + operator:
add_results = results + results
We can see that the results have doubled:
len(add_results)
This will return the number of results:
16
Flattening results
If a field of results contains dictionaries we can flatten them into separate fields by calling the flatten() method. This method takes a list of the fields to flatten and a boolean indicator whether to preserve the original fields in the new Results object that is returned.
For example:
from edsl import QuestionDict, Model
m = Model("gemini-1.5-flash")
q = QuestionDict(
question_name = "recipe",
question_text = "Please provide a simple recipe for hot chocolate.",
answer_keys = ["title", "ingredients", "instructions"]
)
r = q.by(m).run()
r.select("model", "recipe").flatten(field="answer.recipe", keep_original=True)
This will return a table of the flattened results:
model.model |
answer.recipe |
answer.recipe.title |
answer.recipe.ingredients |
answer.recipe.instructions |
---|---|---|---|---|
gemini-1.5-flash |
{‘title’: ‘Simple Hot Chocolate’, ‘ingredients’: [‘1 cup milk (dairy or non-dairy)’, ‘1 tablespoon unsweetened cocoa powder’, ‘1-2 tablespoons sugar (or to taste)’, ‘Pinch of salt’], ‘instructions’: [‘Combine milk, cocoa powder, sugar, and salt in a small saucepan.’, ‘Heat over medium heat, stirring constantly, until the mixture is smooth and heated through.’, ‘Do not boil.’, ‘Pour into a mug and enjoy!’]} |
Simple Hot Chocolate |
[‘1 cup milk (dairy or non-dairy)’, ‘1 tablespoon unsweetened cocoa powder’, ‘1-2 tablespoons sugar (or to taste)’, ‘Pinch of salt’] |
[‘Combine milk, cocoa powder, sugar, and salt in a small saucepan.’, ‘Heat over medium heat, stirring constantly, until the mixture is smooth and heated through.’, ‘Do not boil.’, ‘Pour into a mug and enjoy!’] |
Generating a report
We can create a report of the results by calling the report() method and passing the columns to be included (all columns are included by default). This generates a report in markdown by iterating through the rows, presented as observations. You can optionally pass headers, a divider and a limit on the number of observations to include. It can be useful if you want to display some sample part of larger results in a working notebook you are sharing.
For example, the following code will generate a report of the first 4 results:
from edsl import QuestionFreeText, ScenarioList, Model
m = Model("gemini-1.5-flash")
s = ScenarioList.from_list("language", ["German", "Dutch", "French", "English"])
q = QuestionFreeText(
question_name = "poem",
question_text = "Please write me a short poem about winter in {{ language }}."
)
r = q.by(s).by(m).run()
r.select("model", "poem", "language").report(top_n=2, divider=False, return_string=True)
This will return a report of the first 2 results:
Observation: 1
model.model
gemini-1.5-flash
answer.poem
Der Schnee fällt leis', ein weicher Flor, Die Welt in Weiß, ein Zauberchor. Die Bäume stehn, in Stille gehüllt, Der Winterwind, sein Lied erfüllt.
(Translation: The snow falls softly, a gentle veil, / The world in white, a magic choir. / The trees stand, wrapped in silence, / The winter wind, its song fulfilled.)
scenario.language
German
Observation: 2
model.model
gemini-1.5-flash
answer.poem
De winter komt, de dagen kort, De sneeuw valt zacht, een wit decor. De bomen staan, kaal en stil, Een ijzige wind, een koude tril.
(Translation: Winter comes, the days are short, / The snow falls softly, a white décor. / The trees stand, bare and still, / An icy wind, a cold shiver.)
scenario.language
Dutch
"# Observation: 1\n## model.model\ngemini-1.5-flash\n## answer.poem\nDer Schnee fällt leis', ein weicher Flor,\nDie Welt in Weiß, ein Zauberchor.\nDie Bäume stehn, in Stille gehüllt,\nDer Winterwind, sein Lied erfüllt.\n\n(Translation: The snow falls softly, a gentle veil, / The world in white, a magic choir. / The trees stand, wrapped in silence, / The winter wind, its song fulfilled.)\n## scenario.language\nGerman\n\n---\n\n# Observation: 2\n## model.model\ngemini-1.5-flash\n## answer.poem\nDe winter komt, de dagen kort,\nDe sneeuw valt zacht, een wit decor.\nDe bomen staan, kaal en stil,\nEen ijzige wind, een koude tril.\n\n(Translation: Winter comes, the days are short, / The snow falls softly, a white décor. / The trees stand, bare and still, / An icy wind, a cold shiver.)\n## scenario.language\nDutch\n"
Accessing results with SQL
We can interact with results via SQL using the sql method. This is done by passing a SQL query and a shape (“long” or “wide”) for the resulting table, where the table name in the query is “self”.
For example, the following code will return a table showing the model, persona, read and important columns for the first 4 results:
results.sql("select model, persona, read, important from self limit 4")
This following table will be displayed
model |
persona |
read |
important |
---|---|---|---|
gemini-1.5-flash |
student |
Yes |
5 |
gpt-4o |
student |
Yes |
5 |
gemini-1.5-flash |
student |
No |
1 |
gpt-4o |
student |
No |
3 |
Dataframes
We can also export results to other formats. The to_pandas method will turn our results into a Pandas dataframe:
results.to_pandas()
For example, here we use it to create a dataframe consisting of the models, personas and the answers to the important question:
results.to_pandas()[["model.model", "agent.persona", "answer.important"]]
Exporting to CSV or JSON
The to_csv method will write the results to a CSV file:
results.to_pandas().to_csv("results.csv")
The to_json method will write the results to a JSON file:
results.to_pandas().to_json("results.json")
Revising prompts to improve results
If any of your results are missing model responses, you can use the spot_issues() method to help identify the issues and then revise the prompts to improve the results. This method runs a meta-survey of (2) questions for any prompts that generated a bad or null response, and then returns the results of the meta-survey.
The first question in the survey is a QuestionFreeText question which prompts the model to describe the likely issues with the prompts:
The following prompts generated a bad or null response: '{{ original_prompts }}'
What do you think was the likely issue(s)?
The second question in the survey is a QuestionDict question which prompts the model to return a dictionary consisting of revised user and system prompts:
The following prompts generated a bad or null response: '{{ original_prompts }}'
You identified the issue(s) as '{{ issues.answer }}'.
Please revise the prompts to address the issue(s).
You can optionally pass a list of models to use with the meta-survey, instead of the default model.
Example usage:
# Returns a Results object with the results of the meta-survey
results.spot_issues(models=["gpt-4o"])
# You can inspect the metadata for your original prompts together with the results of the meta-survey
results.select(
"original_question", # The name of the question that generated a bad or null response
"original_agent_index", # The index of the agent that generated a bad or null response
"original_scenario_index", # The index of the scenario that generated a bad or null response
"original_prompts", # The original prompts that generated a bad or null response
"answer.issues", # Free text description of potential issues in the original prompts
"answer.revised" # A dictionary of revised user and system prompts
)
See an example of the method.
Exceptions
If any exceptions are raised when the survey is run a detailed exceptions report is generated and can be opened in your browser. See the Exceptions & Debugging section for more information on exceptions.
Result class
The Result class captures the complete data from one agent interview.
A Result object stores the agent, scenario, language model, and all answers provided during an interview, along with metadata such as token usage, caching information, and raw model responses. It provides a rich interface for accessing this data and supports serialization for storage and retrieval.
Key features:
Dictionary-like access to all data through the UserDict interface
Properties for convenient access to common attributes (agent, scenario, model, answer)
Rich data structure with sub-dictionaries for organization
Support for scoring results against reference answers
Serialization to/from dictionaries for storage
Results are typically created by the Jobs system when running interviews and collected into a Results collection for analysis. You rarely need to create Result objects manually.
Results class
A collection of Result objects with powerful data analysis capabilities.
The Results class is the primary container for working with data from EDSL surveys. It provides a rich set of methods for data analysis, transformation, and visualization inspired by data manipulation libraries like dplyr and pandas. The Results class implements a functional, fluent interface for data manipulation where each method returns a new Results object, allowing method chaining.
- Attributes:
survey: The Survey object containing the questions used to generate results. data: A list of Result objects containing the responses. created_columns: A list of column names created through transformations. cache: A Cache object for storing model responses. completed: Whether the Results object is ready for use. task_history: A TaskHistory object containing information about the tasks. known_data_types: List of valid data type strings for accessing data.
- Key features:
List-like interface for accessing individual Result objects
Selection of specific data columns with select()
Filtering results with boolean expressions using filter()
Creating new derived columns with mutate()
Recoding values with recode() and answer_truncate()
Sorting results with order_by()
Converting to other formats (dataset, table, pandas DataFrame)
Serialization for storage and retrieval
Support for remote execution and result retrieval
- Results objects have a hierarchical structure with the following components:
Each Results object contains multiple Result objects
Each Result object contains data organized by type (agent, scenario, model, answer, etc.)
Each data type contains multiple attributes (e.g., “how_feeling” in the answer type)
You can access data in a Results object using dot notation (answer.how_feeling) or using just the attribute name if it’s not ambiguous (how_feeling).
The Results class also tracks “created columns” - new derived values that aren’t part of the original data but were created through transformations.
- Examples:
>>> # Create a simple Results object from example data >>> r = Results.example() >>> len(r) > 0 # Contains Result objects True >>> # Filter and transform data >>> filtered = r.filter("how_feeling == 'Great'") >>> # Access hierarchical data >>> 'agent' in r.known_data_types True