Dataset

The Dataset class is a versatile data container for tabular data with powerful manipulation capabilities. It represents data in a column-oriented format, providing methods for analysis, transformation, and visualization.

Overview

Dataset is a fundamental data structure in EDSL that provides a column-oriented representation of tabular data. It offers methods for manipulating, analyzing, visualizing, and exporting data, similar to tools like pandas or dplyr.

Key features:

Flexible data manipulation (filtering, sorting, transformation)
Visualization capabilities with multiple rendering options
Export to various formats (CSV, Excel, Pandas, etc.)
Integration with other EDSL components

Creating Datasets

Datasets can be created from various sources:

From dictionaries:

from edsl import Dataset

# Create a dataset with two columns
d = Dataset([{'a': [1, 2, 3]}, {'b': [4, 5, 6]}])

From existing EDSL objects:

# From Results object
dataset = results.select('how_feeling', 'agent.status')

# From pandas DataFrame
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
dataset = Dataset.from_pandas_dataframe(df)

Displaying and Visualizing Data

The Dataset class provides multiple ways to display and visualize data:

Basic display:

# Print the dataset
dataset.print()

# Display only the first few rows
dataset.head()

# Display only the last few rows
dataset.tail()

Table Display Options

You can control the table formatting using the tablefmt parameter:

# Display as an ASCII grid
dataset.table(tablefmt="grid")

# Display as pipe-separated values
dataset.table(tablefmt="pipe")

# Display as HTML
dataset.table(tablefmt="html")

# Display as Markdown
dataset.table(tablefmt="github")

# Display as LaTeX
dataset.table(tablefmt="latex")

Rich Terminal Output

The Dataset class supports displaying tables with enhanced formatting using the Rich library, which provides beautiful terminal formatting with colors, styles, and more:

# Display using Rich formatting
dataset.table(tablefmt="rich")

# Alternative syntax
dataset.print(format="rich")

This creates a nicely formatted table in the terminal with automatically sized columns, bold headers, and grid lines.

Example:

from edsl import Dataset

# Create a dataset
d = Dataset([
    {'name': ['Alice', 'Bob', 'Charlie', 'David']},
    {'age': [25, 32, 45, 19]},
    {'city': ['New York', 'Los Angeles', 'Chicago', 'Boston']}
])

# Display with rich formatting
d.table(tablefmt="rich")

Data Manipulation

The Dataset class provides numerous methods for data manipulation:

Filtering:

# Filter results for a specific condition
filtered = dataset.filter("how_feeling == 'Great'")

Creating new columns:

# Create a new column
with_sentiment = dataset.mutate("sentiment = 1 if how_feeling == 'Great' else 0")

Sorting:

# Sort by a specific column
sorted_data = dataset.order_by("age", reverse=True)

Reshaping:

# Convert to long format
long_data = dataset.long()

# Convert back to wide format
wide_data = long_data.wide()

Exporting Data

Export to various formats:

# Export to CSV
dataset.to_csv("data.csv")

# Export to pandas DataFrame
df = dataset.to_pandas()

# Export to Word document
dataset.to_docx("data.docx", title="My Dataset")

Dataset Methods

class edsl.dataset.Dataset(data: list[dict[str, Any]] = None, print_parameters: dict | None = None)[source]

Bases: UserList, DatasetOperationsMixin, PersistenceMixin, HashingMixin

A versatile data container for tabular data with powerful manipulation capabilities.

The Dataset class is a fundamental data structure in EDSL that represents tabular data in a column-oriented format. It provides a rich set of methods for data manipulation, transformation, analysis, visualization, and export through the DatasetOperationsMixin.

Key features:

Column-oriented data structure optimized for LLM experiment results
Rich data manipulation API similar to dplyr/pandas (filter, select, mutate, etc.)
Visualization capabilities including tables, plots, and reports
Export to various formats (CSV, Excel, SQLite, pandas, etc.)
Serialization for storage and transport
Tree-based data exploration

A Dataset typically contains multiple columns, each represented as a dictionary with a single key-value pair. The key is the column name and the value is a list of values for that column. All columns must have the same length.

The Dataset class inherits from: - UserList: Provides list-like behavior for storing column data - DatasetOperationsMixin: Provides data manipulation methods - PersistenceMixin: Provides serialization capabilities - HashingMixin: Provides hashing functionality for comparison and storage

Datasets are typically created by transforming other EDSL container types like Results, AgentList, or ScenarioList, but can also be created directly from data.

collapse(field: str, separator: str | None = None) → Dataset[source]

classmethod example(n: int = None)[source]

Return an example dataset.

>>> Dataset.example()
Dataset([{'a': [1, 2, 3, 4]}, {'b': [4, 3, 2, 1]}])

expand(field)[source]

filter(expression)[source]

first() → dict[str, Any][source]

Get the first value of the first key in the first dictionary.

>>> d = Dataset([{'a.b':[1,2,3,4]}])
>>> d.first()
1

classmethod from_edsl_object(object)[source]

classmethod from_pandas_dataframe(df)[source]

head(n: int = 5) → Dataset[source]

Return the first n observations in the dataset.

>>> d = Dataset([{'a.b':[1,2,3,4]}])
>>> d.head(2)
Dataset([{'a.b': [1, 2]}])

keys() → list[str][source]

Return the keys of the dataset.

>>> d = Dataset([{'a.b':[1,2,3,4]}])
>>> d.keys()
['a.b']

>>> d = Dataset([{'a.b':[1,2,3,4]}, {'c.d':[5,6,7,8]}])
>>> d.keys()
['a.b', 'c.d']

[‘a.b’]

latex(**kwargs)[source]

long(exclude_fields: list[str] = None) → Dataset[source]

merge(other: Dataset, by_x, by_y) → Dataset[source]

Merge the dataset with another dataset on the given keys.””

merged_df = df1.merge(df2, how=”left”, on=[“key1”, “key2”])

mutate(new_var_string: str, functions_dict: dict[str, Callable] | None = None) → Dataset[source]

order_by(sort_key: str, reverse: bool = False) → Dataset[source]

Return a new dataset with the observations sorted by the given key.

Parameters:

sort_key – The key to sort the observations by.
reverse – Whether to sort in reverse order.

>>> d = Dataset([{'a':[1,2,3,4]}, {'b':[4,3,2,1]}])
>>> d.order_by('a')
Dataset([{'a': [1, 2, 3, 4]}, {'b': [4, 3, 2, 1]}])

>>> d.order_by('a', reverse=True)
Dataset([{'a': [4, 3, 2, 1]}, {'b': [1, 2, 3, 4]}])

>>> d = Dataset([{'X.a':[1,2,3,4]}, {'X.b':[4,3,2,1]}])
>>> d.order_by('a')
Dataset([{'X.a': [1, 2, 3, 4]}, {'X.b': [4, 3, 2, 1]}])

print(pretty_labels=None, **kwargs)[source]

Print the dataset in a formatted way.

Args:: pretty_labels: A dictionary mapping column names to their display names **kwargs: Additional arguments

format: The output format (“html”, “markdown”, “rich”, “latex”)
Returns:: TableDisplay object

remove_prefix() → Dataset[source]

Returns a new Dataset with the prefix removed from all column names.

The prefix is defined as everything before the first dot (.) in the column name. If removing prefixes would result in duplicate column names, an exception is raised.

Returns:

Dataset: A new Dataset with prefixes removed from column names

Raises:

ValueError: If removing prefixes would result in duplicate column names

Examples:

>>> from edsl.results import Results
>>> r = Results.example()
>>> r.select('how_feeling', 'how_feeling_yesterday').relevant_columns()
['answer.how_feeling', 'answer.how_feeling_yesterday']
>>> r.select('how_feeling', 'how_feeling_yesterday').remove_prefix().relevant_columns()
['how_feeling', 'how_feeling_yesterday']

>>> from edsl.dataset import Dataset
>>> d = Dataset([{'a.x': [1, 2, 3]}, {'b.x': [4, 5, 6]}])
>>> # d.remove_prefix()

Traceback (most recent call last): … ValueError: Removing prefixes would result in duplicate column names: [‘x’]

rename(rename_dic) → Dataset[source]

sample(n: int = None, frac: float = None, with_replacement: bool = True, seed: str | int | float = None) → Dataset[source]

Return a new dataset with a sample of the observations.

Parameters:

n – The number of samples to take.
frac – The fraction of samples to take.
with_replacement – Whether to sample with replacement.
seed – The seed for the random number generator.

>>> d = Dataset([{'a.b':[1,2,3,4]}])
>>> d.sample(n=2, seed=0, with_replacement=True)
Dataset([{'a.b': [4, 4]}])

>>> d.sample(n = 10, seed=0, with_replacement=False)
Traceback (most recent call last):
...
ValueError: Sample size cannot be greater than the number of available elements when sampling without replacement.

select(*keys) → Dataset[source]

Return a new dataset with only the selected keys.

Parameters:: keys – The keys to select.

>>> d = Dataset([{'a.b':[1,2,3,4]}, {'c.d':[5,6,7,8]}])
>>> d.select('a.b')
Dataset([{'a.b': [1, 2, 3, 4]}])

>>> d.select('a.b', 'c.d')
Dataset([{'a.b': [1, 2, 3, 4]}, {'c.d': [5, 6, 7, 8]}])

shuffle(seed=None) → Dataset[source]

Return a new dataset with the observations shuffled.

>>> d = Dataset([{'a.b':[1,2,3,4]}])
>>> d.shuffle(seed=0)
Dataset([{'a.b': [3, 1, 2, 4]}])

summary()[source]

table(*fields, tablefmt: str | None = 'rich', max_rows: int | None = None, pretty_labels=None, print_parameters: dict | None = None)[source]

tail(n: int = 5) → Dataset[source]

Return the last n observations in the dataset.

>>> d = Dataset([{'a.b':[1,2,3,4]}])
>>> d.tail(2)
Dataset([{'a.b': [3, 4]}])

to(survey_or_question: 'Survey' | 'QuestionBase') → Jobs[source]: Return a new dataset with the observations transformed by the given survey or question.

to_docx(output_file: str, title: str = None) → None[source]

Convert the dataset to a Word document.

Args:: output_file (str): Path to save the Word document title (str, optional): Title for the document

to_json()[source]

Return a JSON representation of the dataset.

>>> d = Dataset([{'a.b':[1,2,3,4]}])
>>> d.to_json()
[{'a.b': [1, 2, 3, 4]}]

tree(node_order: list[str] | None = None) → Tree[source]

Return a tree representation of the dataset.

>>> d = Dataset([{'a':[1,2,3,4]}, {'b':[4,3,2,1]}])
>>> d.tree()
Tree(Dataset({'a': [1, 2, 3, 4], 'b': [4, 3, 2, 1]}), node_order=['a', 'b'])

view()[source]

wide() → Dataset[source]

Convert a long-format dataset (with row, key, value columns) to wide format.

Expected input format: - A dataset with three columns containing dictionaries:

row: list of row indices

key: list of column names

value: list of values

Returns: - Dataset: A new dataset with columns corresponding to unique keys

write(filename: str, tablefmt: str | None = None) → None[source]