Overview
Dataset is a fundamental data structure in EDSL that provides a column-oriented representation of tabular data. It offers methods for manipulating, analyzing, visualizing, and exporting data, similar to tools like pandas or dplyr. Key features:- Flexible data manipulation (filtering, sorting, transformation)
- Visualization capabilities with multiple rendering options
- Export to various formats (CSV, Excel, Pandas, etc.)
- Integration with other EDSL components
Creating Datasets
Datasets can be created from various sources: From dictionaries:Displaying and Visualizing Data
The Dataset class provides multiple ways to display and visualize data: Basic display:Table Display Options
You can control the table formatting using thetablefmt
parameter:
Rich Terminal Output
The Dataset class supports displaying tables with enhanced formatting using the Rich library, which provides beautiful terminal formatting with colors, styles, and more:Data Manipulation
The Dataset class provides numerous methods for data manipulation: Filtering:Exporting Data
Export to various formats:Dataset Methods
Bases:class edsl.dataset.Dataset(data: list [dict [str, Any ] ] = None, print _parameters: dict | None = None) [source ]
UserList
, DatasetOperationsMixin
, PersistenceMixin
, HashingMixin
A versatile data container for tabular data with powerful manipulation capabilities.
The Dataset class is a fundamental data structure in EDSL that represents tabular data in a column-oriented format. It provides a rich set of methods for data manipulation, transformation, analysis, visualization, and export through the DatasetOperationsMixin.
Key features:
- Column-oriented data structure optimized for LLM experiment results
- Rich data manipulation API similar to dplyr/pandas (filter, select, mutate, etc.)
- Visualization capabilities including tables, plots, and reports
- Export to various formats (CSV, Excel, SQLite, pandas, etc.)
- Serialization for storage and transport
- Tree-based data exploration
Collapse multiple values in a field into a single value using a separator.collapse(field: str, separator: str | None = None) → Dataset [source ]
Args:field: The name of the field to collapse. separator: Optional string to use as a separator between values.
Defaults to a space if not specified.
Examples:
Returns a new Dataset with the specified field removed.drop(field _name) [source ]
Args:field _name (str): The name of the field to remove.
Returns:Dataset: A new Dataset instance without the specified field.
Raises:KeyError: If the field _name doesn’t exist in the dataset.
Examples:
Return an example dataset.classmethod example(n: int = None) → Dataset [source ]
Examples:
Expand a field containing lists into multiple rows.expand(field: str, number _field: bool = False) → Dataset [source ]
Args:field: The field containing lists to expand number _field: If True, adds a number field indicating the position in the original list
Returns:A new Dataset with the expanded rows
Examples:
Expand a field in the dataset. Renamed to avoid conflict with the expand method defined earlier.expand _field(field) [source ]
Filter the dataset based on a boolean expression.filter(expression) → Dataset [source ]
Args:expression: A string expression that evaluates to a boolean value. Can reference column names in the dataset.
Examples:
Get the first value of the first key in the first dictionary.first() → dict [str, Any ] [source ]
Examples:
Convert a dictionary to a dataset.classmethod from _dict(data: dict) → Dataset [source ]
Examples:
classmethod from _edsl _object(object) [source ]
classmethod from _pandas _dataframe(df) [source ]
Return the indices that would sort the list, using either numpy or pure Python. None values are placed at the end of the sorted list.get _sort _indices(lst: list [Any ], reverse: bool = False) → list [int ] [source ]
Args:lst: The list to be sorted reverse: Whether to sort in descending order use _numpy: Whether to use numpy implementation (falls back to pure Python if numpy is unavailable)
Returns:A list of indices that would sort the list
Return the first n observations in the dataset.head(n: int = 5) → Dataset [source ]
Return the keys of the dataset.keys() → list [str ] [source ]
Examples:
Return a LaTeX representation of the dataset.latex(* * kwargs) [source ]
Args:* *kwargs: Additional arguments to pass to the table formatter.
Convert the dataset from wide to long format.long(exclude _fields: list [str ] = None) → Dataset [source ]
Examples:
Merge the dataset with another dataset on the given keys.merge(other: Dataset, by _x, by _y) → Dataset [source ]
Examples:
Return a new dataset with the observations sorted by the given key.order _by(sort _key: str, reverse: bool = False) → Dataset [source ]
Args:sort _key: The key to sort the observations by reverse: Whether to sort in reverse order
Examples:
Print the dataset in a formatted way.print(pretty _labels =None, * * kwargs) [source ]
Args:pretty _labels: A dictionary mapping column names to their display names * *kwargs: Additional arguments format: The output format (“html”, “markdown”, “rich”, “latex”)
Returns:TableDisplay object
Examples:
Remove the prefix from column names that contain dots.remove _prefix() → Dataset [source ]
Examples:
Rename columns in the dataset according to the provided dictionary.rename(rename _dic) → Dataset [source ]
Examples:
Return a new dataset with a sample of the observations.sample(n: int = None, frac: float = None, with _replacement: bool = True, seed: str | int | float = None) → Dataset [source ]
Examples:
Return a new dataset with only the selected keys.select(* keys) → Dataset [source ]
Examples:
Return a new dataset with the observations shuffled.shuffle(seed =None) → Dataset [source ]
Examples:
Return a summary of the dataset.summary() → Dataset [source ]
Examples:
table(* fields, tablefmt: str | None = ‘rich’, max _rows: int | None = None, pretty _labels =None, print _parameters: dict | None = None) [source ]
Return the last n observations in the dataset.tail(n: int = 5) → Dataset [source ]
Transform the dataset using a survey or question.to(survey _or _question: ‘Survey’ | ‘QuestionBase’) → Job [source ]
Args:survey _or _question: Either a Survey or QuestionBase object to apply to the dataset.
Examples:
Convert the dataset to a dictionary.to _dict() → dict [source ]
Examples:
Convert the dataset to a Word document.to _docx(output _file: str, title: str = None) → None [source ]
Args:output _file (str): Path to save the Word document title (str, optional): Title for the document
Examples:
Return a JSON representation of the dataset.to _json() [source ]
Examples:
Return a tree representation of the dataset.tree(node _order: list [str ] | None = None) → Tree [source ]
Remove duplicate rows from the dataset.unique() → Dataset [source ]
Returns:A new Dataset with duplicate rows removed.
Examples:
Convert a long-format dataset (with row, key, value columns) to wide format.wide() → Dataset [source ]
Examples:
Write the dataset to a file in the specified format.write(filename: str, tablefmt: str | None = None) → None [source ]
Args:filename: The name of the file to write to. tablefmt: Optional format for the table (e.g., ‘csv’, ‘html’, ‘latex’).