Dataset

The Dataset class is a versatile data container for tabular data with powerful manipulation capabilities. It represents data in a column-oriented format, providing methods for analysis, transformation, and visualization.

Overview

Dataset is a fundamental data structure in EDSL that provides a column-oriented representation of tabular data. It offers methods for manipulating, analyzing, visualizing, and exporting data, similar to tools like pandas or dplyr.

Key features:

  1. Flexible data manipulation (filtering, sorting, transformation)

  2. Visualization capabilities with multiple rendering options

  3. Export to various formats (CSV, Excel, Pandas, etc.)

  4. Integration with other EDSL components

Creating Datasets

Datasets can be created from various sources:

From dictionaries:

from edsl import Dataset

# Create a dataset with two columns
d = Dataset([{'a': [1, 2, 3]}, {'b': [4, 5, 6]}])

From existing EDSL objects:

# From Results object
dataset = results.select('how_feeling', 'agent.status')

# From pandas DataFrame
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
dataset = Dataset.from_pandas_dataframe(df)

Displaying and Visualizing Data

The Dataset class provides multiple ways to display and visualize data:

Basic display:

# Print the dataset
dataset.print()

# Display only the first few rows
dataset.head()

# Display only the last few rows
dataset.tail()

Table Display Options

You can control the table formatting using the tablefmt parameter:

# Display as an ASCII grid
dataset.table(tablefmt="grid")

# Display as pipe-separated values
dataset.table(tablefmt="pipe")

# Display as HTML
dataset.table(tablefmt="html")

# Display as Markdown
dataset.table(tablefmt="github")

# Display as LaTeX
dataset.table(tablefmt="latex")

Rich Terminal Output

The Dataset class supports displaying tables with enhanced formatting using the Rich library, which provides beautiful terminal formatting with colors, styles, and more:

# Display using Rich formatting
dataset.table(tablefmt="rich")

# Alternative syntax
dataset.print(format="rich")

This creates a nicely formatted table in the terminal with automatically sized columns, bold headers, and grid lines.

Example:

from edsl import Dataset

# Create a dataset
d = Dataset([
    {'name': ['Alice', 'Bob', 'Charlie', 'David']},
    {'age': [25, 32, 45, 19]},
    {'city': ['New York', 'Los Angeles', 'Chicago', 'Boston']}
])

# Display with rich formatting
d.table(tablefmt="rich")

Data Manipulation

The Dataset class provides numerous methods for data manipulation:

Filtering:

# Filter results for a specific condition
filtered = dataset.filter("how_feeling == 'Great'")

Creating new columns:

# Create a new column
with_sentiment = dataset.mutate("sentiment = 1 if how_feeling == 'Great' else 0")

Sorting:

# Sort by a specific column
sorted_data = dataset.order_by("age", reverse=True)

Reshaping:

# Convert to long format
long_data = dataset.long()

# Convert back to wide format
wide_data = long_data.wide()

Exporting Data

Export to various formats:

# Export to CSV
dataset.to_csv("data.csv")

# Export to pandas DataFrame
df = dataset.to_pandas()

# Export to Word document
dataset.to_docx("data.docx", title="My Dataset")

Dataset Methods

class edsl.dataset.Dataset(data: list[dict[str, Any]] = None, print_parameters: dict | None = None)[source]

Bases: UserList, DatasetOperationsMixin, PersistenceMixin, HashingMixin

A versatile data container for tabular data with powerful manipulation capabilities.

The Dataset class is a fundamental data structure in EDSL that represents tabular data in a column-oriented format. It provides a rich set of methods for data manipulation, transformation, analysis, visualization, and export through the DatasetOperationsMixin.

Key features:

  1. Column-oriented data structure optimized for LLM experiment results

  2. Rich data manipulation API similar to dplyr/pandas (filter, select, mutate, etc.)

  3. Visualization capabilities including tables, plots, and reports

  4. Export to various formats (CSV, Excel, SQLite, pandas, etc.)

  5. Serialization for storage and transport

  6. Tree-based data exploration

A Dataset typically contains multiple columns, each represented as a dictionary with a single key-value pair. The key is the column name and the value is a list of values for that column. All columns must have the same length.

The Dataset class inherits from: - UserList: Provides list-like behavior for storing column data - DatasetOperationsMixin: Provides data manipulation methods - PersistenceMixin: Provides serialization capabilities - HashingMixin: Provides hashing functionality for comparison and storage

Datasets are typically created by transforming other EDSL container types like Results, AgentList, or ScenarioList, but can also be created directly from data.

collapse(field: str, separator: str | None = None) Dataset[source]

Collapse multiple values in a field into a single value using a separator.

Args:

field: The name of the field to collapse. separator: Optional string to use as a separator between values.

Defaults to a space if not specified.

Examples:
>>> d = Dataset([{'words': [['hello', 'world'], ['good', 'morning']]}])
>>> d.collapse('words').data
[{'words': [[['hello', 'world'], ['good', 'morning']]]}]
>>> d = Dataset([{'numbers': [1, 2, 3]}])
>>> d.collapse('numbers', separator=',').data
[{'numbers': ['1,2,3']}]
classmethod example(n: int = None) Dataset[source]

Return an example dataset.

Examples:
>>> Dataset.example()
Dataset([{'a': [1, 2, 3, 4]}, {'b': [4, 3, 2, 1]}])
>>> Dataset.example(n=2)
Dataset([{'a': [1, 1]}, {'b': [2, 2]}])
expand(field: str, number_field: bool = False) Dataset[source]

Expand a field containing lists into multiple rows.

Args:

field: The field containing lists to expand number_field: If True, adds a number field indicating the position in the original list

Returns:

A new Dataset with the expanded rows

Examples:
>>> d = Dataset([{'a': [[1, 2, 3], [4, 5, 6]]}, {'b': ['x', 'y']}])
>>> d.expand('a').data
[{'a': [1, 2, 3, 4, 5, 6]}, {'b': ['x', 'x', 'x', 'y', 'y', 'y']}]
>>> d = Dataset([{'items': [['apple', 'banana'], ['orange']]}, {'id': [1, 2]}])
>>> d.expand('items', number_field=True).data
[{'items': ['apple', 'banana', 'orange']}, {'id': [1, 1, 2]}, {'items_number': [1, 2, 1]}]
expand_field(field)[source]

Expand a field in the dataset.

Renamed to avoid conflict with the expand method defined earlier.

filter(expression) Dataset[source]

Filter the dataset based on a boolean expression.

Args:
expression: A string expression that evaluates to a boolean value.

Can reference column names in the dataset.

Examples:
>>> d = Dataset([{'a': [1, 2, 3, 4]}, {'b': [5, 6, 7, 8]}])
>>> d.filter('a > 2').data
[{'a': [3, 4]}, {'b': [7, 8]}]
>>> d = Dataset([{'x': ['a', 'b', 'c']}, {'y': [1, 2, 3]}])
>>> d.filter('y < 3').data
[{'x': ['a', 'b']}, {'y': [1, 2]}]
first() dict[str, Any][source]

Get the first value of the first key in the first dictionary.

Examples:
>>> d = Dataset([{'a': [1, 2, 3, 4]}, {'b': [5, 6, 7, 8]}])
>>> d.first()
1
>>> d = Dataset([{'x': ['first', 'second']}])
>>> d.first()
'first'
classmethod from_dict(data: dict) Dataset[source]

Convert a dictionary to a dataset.

Examples:
>>> d = Dataset.from_dict({'data': [{'a': [1, 2, 3]}, {'b': [4, 5, 6]}]})
>>> isinstance(d, Dataset)
True
>>> d.data
[{'a': [1, 2, 3]}, {'b': [4, 5, 6]}]
>>> d = Dataset.from_dict({'data': [{'x': ['a', 'b']}]})
>>> d.data
[{'x': ['a', 'b']}]
classmethod from_edsl_object(object)[source]
classmethod from_pandas_dataframe(df)[source]
get_sort_indices(lst: list[Any], reverse: bool = False) list[int][source]

Return the indices that would sort the list, using either numpy or pure Python. None values are placed at the end of the sorted list.

Args:

lst: The list to be sorted reverse: Whether to sort in descending order use_numpy: Whether to use numpy implementation (falls back to pure Python if numpy is unavailable)

Returns:

A list of indices that would sort the list

head(n: int = 5) Dataset[source]

Return the first n observations in the dataset.

>>> d = Dataset([{'a.b':[1,2,3,4]}])
>>> d.head(2)
Dataset([{'a.b': [1, 2]}])
keys() list[str][source]

Return the keys of the dataset.

Examples:
>>> d = Dataset([{'a': [1, 2, 3, 4]}, {'b': [5, 6, 7, 8]}])
>>> d.keys()
['a', 'b']
>>> d = Dataset([{'x.y': [1, 2]}, {'z.w': [3, 4]}])
>>> d.keys()
['x.y', 'z.w']
latex(**kwargs)[source]

Return a LaTeX representation of the dataset.

Args:

**kwargs: Additional arguments to pass to the table formatter.

long(exclude_fields: list[str] = None) Dataset[source]

Convert the dataset from wide to long format.

Examples:
>>> d = Dataset([{'a': [1, 2], 'b': [3, 4]}])
>>> d.long().data
[{'row': [0, 0, 1, 1]}, {'key': ['a', 'b', 'a', 'b']}, {'value': [1, 3, 2, 4]}]
>>> d = Dataset([{'x': [1, 2], 'y': [3, 4], 'z': [5, 6]}])
>>> d.long(exclude_fields=['z']).data
[{'row': [0, 0, 1, 1]}, {'key': ['x', 'y', 'x', 'y']}, {'value': [1, 3, 2, 4]}, {'z': [5, 5, 6, 6]}]
merge(other: Dataset, by_x, by_y) Dataset[source]

Merge the dataset with another dataset on the given keys.

Examples:
>>> d1 = Dataset([{'key': [1, 2, 3]}, {'value1': ['a', 'b', 'c']}])
>>> d2 = Dataset([{'key': [2, 3, 4]}, {'value2': ['x', 'y', 'z']}])
>>> merged = d1.merge(d2, 'key', 'key')
>>> len(merged.data[0]['key'])
3
>>> d1 = Dataset([{'id': [1, 2]}, {'name': ['Alice', 'Bob']}])
>>> d2 = Dataset([{'id': [2, 3]}, {'age': [25, 30]}])
>>> merged = d1.merge(d2, 'id', 'id')
>>> len(merged.data[0]['id'])
2
mutate(new_var_string: str, functions_dict: dict[str, Callable] | None = None) Dataset[source]

Create new columns by applying functions to existing columns.

Args:
new_var_string: A string expression defining the new variable.

Can reference existing column names.

functions_dict: Optional dictionary of custom functions to use in the expression.

Examples:
>>> d = Dataset([{'a': [1, 2, 3]}, {'b': [4, 5, 6]}])
>>> d.mutate('c = a + b').data
[{'a': [1, 2, 3]}, {'b': [4, 5, 6]}, {'c': [5, 7, 9]}]
>>> d = Dataset([{'x': [1, 2, 3]}])
>>> d.mutate('y = x * 2').data
[{'x': [1, 2, 3]}, {'y': [2, 4, 6]}]
order_by(sort_key: str, reverse: bool = False) Dataset[source]

Return a new dataset with the observations sorted by the given key.

Args:

sort_key: The key to sort the observations by reverse: Whether to sort in reverse order

Examples:
>>> d = Dataset([{'a': [3, 1, 4, 1, 5]}, {'b': ['x', 'y', 'z', 'w', 'v']}])
>>> sorted_d = d.order_by('a')
>>> sorted_d.data
[{'a': [1, 1, 3, 4, 5]}, {'b': ['y', 'w', 'x', 'z', 'v']}]
>>> d = Dataset([{'a': [3, 1, 4, 1, 5]}, {'b': ['x', 'y', 'z', 'w', 'v']}])
>>> sorted_d = d.order_by('a', reverse=True)
>>> sorted_d.data
[{'a': [5, 4, 3, 1, 1]}, {'b': ['v', 'z', 'x', 'y', 'w']}]
>>> d = Dataset([{'a': [3, None, 1, 4, None]}, {'b': ['x', 'y', 'z', 'w', 'v']}])
>>> sorted_d = d.order_by('a')
>>> sorted_d.data
[{'a': [1, 3, 4, None, None]}, {'b': ['z', 'x', 'w', 'y', 'v']}]
print(pretty_labels=None, **kwargs)[source]

Print the dataset in a formatted way.

Args:

pretty_labels: A dictionary mapping column names to their display names **kwargs: Additional arguments

format: The output format (“html”, “markdown”, “rich”, “latex”)

Returns:

TableDisplay object

Examples:
>>> d = Dataset([{'a': [1, 2, 3]}, {'b': [4, 5, 6]}])
>>> display = d.print(format='rich')
>>> display is not None
True
>>> d = Dataset([{'long_column_name': [1, 2]}])
>>> display = d.print(pretty_labels={'long_column_name': 'Short'})
>>> display is not None
True
remove_prefix() Dataset[source]

Remove the prefix from column names that contain dots.

Examples:
>>> d = Dataset([{'a.b': [1, 2, 3]}, {'c.d': [4, 5, 6]}])
>>> d.remove_prefix().data
[{'b': [1, 2, 3]}, {'d': [4, 5, 6]}]
>>> d = Dataset([{'x.y.z': [1, 2]}, {'a.b.c': [3, 4]}])
>>> d.remove_prefix().data
[{'y': [1, 2]}, {'b': [3, 4]}]
rename(rename_dic) Dataset[source]

Rename columns in the dataset according to the provided dictionary.

Examples:
>>> d = Dataset([{'a': [1, 2, 3]}, {'b': [4, 5, 6]}])
>>> d.rename({'a': 'x', 'b': 'y'}).data
[{'x': [1, 2, 3]}, {'y': [4, 5, 6]}]
>>> d = Dataset([{'old_name': [1, 2]}])
>>> d.rename({'old_name': 'new_name'}).data
[{'new_name': [1, 2]}]
sample(n: int = None, frac: float = None, with_replacement: bool = True, seed: str | int | float = None) Dataset[source]

Return a new dataset with a sample of the observations.

Examples:
>>> d = Dataset([{'a': [1, 2, 3, 4, 5]}, {'b': [6, 7, 8, 9, 10]}])
>>> sampled = d.sample(n=3, seed=42)
>>> len(sampled.data[0]['a'])
3
>>> d = Dataset([{'x': ['a', 'b', 'c', 'd']}])
>>> sampled = d.sample(frac=0.5, seed=123)
>>> len(sampled.data[0]['x'])
2
select(*keys) Dataset[source]

Return a new dataset with only the selected keys.

Examples:
>>> d = Dataset([{'a': [1, 2, 3, 4]}, {'b': [5, 6, 7, 8]}, {'c': [9, 10, 11, 12]}])
>>> d.select('a', 'c').data
[{'a': [1, 2, 3, 4]}, {'c': [9, 10, 11, 12]}]
>>> d = Dataset([{'x': [1, 2]}, {'y': [3, 4]}])
>>> d.select('x').data
[{'x': [1, 2]}]
shuffle(seed=None) Dataset[source]

Return a new dataset with the observations shuffled.

Examples:
>>> d = Dataset([{'a': [1, 2, 3, 4]}, {'b': [5, 6, 7, 8]}])
>>> shuffled = d.shuffle(seed=42)
>>> len(shuffled.data[0]['a']) == len(d.data[0]['a'])
True
>>> d = Dataset([{'x': ['a', 'b', 'c']}])
>>> shuffled = d.shuffle(seed=123)
>>> set(shuffled.data[0]['x']) == set(d.data[0]['x'])
True
summary() Dataset[source]

Return a summary of the dataset.

Examples:
>>> d = Dataset([{'a': [1, 2, 3]}, {'b': [4, 5, 6]}])
>>> d.summary().data
[{'num_observations': [3]}, {'keys': [['a', 'b']]}]
table(*fields, tablefmt: str | None = 'rich', max_rows: int | None = None, pretty_labels=None, print_parameters: dict | None = None)[source]
tail(n: int = 5) Dataset[source]

Return the last n observations in the dataset.

>>> d = Dataset([{'a.b':[1,2,3,4]}])
>>> d.tail(2)
Dataset([{'a.b': [3, 4]}])
to(survey_or_question: 'Survey' | 'QuestionBase') Job[source]

Transform the dataset using a survey or question.

Args:

survey_or_question: Either a Survey or QuestionBase object to apply to the dataset.

Examples:
>>> from edsl import QuestionFreeText
>>> from edsl.jobs import Jobs
>>> d = Dataset([{'name': ['Alice', 'Bob']}])
>>> q = QuestionFreeText(question_text="How are you, {{ name }}?", question_name="how_feeling")
>>> job = d.to(q)
>>> isinstance(job, Jobs)
True
to_dict() dict[source]

Convert the dataset to a dictionary.

Examples:
>>> d = Dataset([{'a': [1, 2, 3]}, {'b': [4, 5, 6]}])
>>> d.to_dict()
{'data': [{'a': [1, 2, 3]}, {'b': [4, 5, 6]}]}
>>> d = Dataset([{'x': ['a', 'b']}])
>>> d.to_dict()
{'data': [{'x': ['a', 'b']}]}
to_docx(output_file: str, title: str = None) None[source]

Convert the dataset to a Word document.

Args:

output_file (str): Path to save the Word document title (str, optional): Title for the document

Examples:
>>> import tempfile
>>> d = Dataset([{'a': [1, 2, 3]}, {'b': [4, 5, 6]}])
>>> with tempfile.NamedTemporaryFile(suffix='.docx') as tmp:
...     d.to_docx(tmp.name, title='Test Document')
...     import os
...     os.path.exists(tmp.name)
True
to_json()[source]

Return a JSON representation of the dataset.

Examples:
>>> d = Dataset([{'a': [1, 2, 3]}, {'b': [4, 5, 6]}])
>>> d.to_json()
[{'a': [1, 2, 3]}, {'b': [4, 5, 6]}]
>>> d = Dataset([{'x': ['a', 'b']}])
>>> d.to_json()
[{'x': ['a', 'b']}]
tree(node_order: list[str] | None = None) Tree[source]

Return a tree representation of the dataset.

>>> d = Dataset([{'a':[1,2,3,4]}, {'b':[4,3,2,1]}])
>>> d.tree()
Tree(Dataset({'a': [1, 2, 3, 4], 'b': [4, 3, 2, 1]}), node_order=['a', 'b'])
wide() Dataset[source]

Convert a long-format dataset (with row, key, value columns) to wide format.

Examples:
>>> d = Dataset([{'row': [0, 0, 1, 1]}, {'key': ['a', 'b', 'a', 'b']}, {'value': [1, 3, 2, 4]}])
>>> d.wide().data
[{'a': [1, 2]}, {'b': [3, 4]}]
>>> d = Dataset([{'row': [0, 0, 1, 1]}, {'key': ['x', 'y', 'x', 'y']}, {'value': [1, 3, 2, 4]}, {'z': [5, 5, 6, 6]}])
>>> d.wide().data
[{'x': [1, 2]}, {'y': [3, 4]}, {'z': [5, 6]}]
write(filename: str, tablefmt: str | None = None) None[source]

Write the dataset to a file in the specified format.

Args:

filename: The name of the file to write to. tablefmt: Optional format for the table (e.g., ‘csv’, ‘html’, ‘latex’).