Datasets

Auto csv

Based on Pandas

Loads a CSV file that contains numeric, categorical, and predictive data columns. This automatically detects the characteristics of the dataset being loaded, namely the delimiter that separates the columns, and whether each column contains numeric or categorical data. A pandas CSV reader is employed internally. The last categorical column is used as the dataset label. To load the file using different options (e.g., a subset of columns, a different label column) use the custom csv loader instead.

If you want to train a model while using the same loading mechanism as this dataset, run the following Python script. This uses supporting methods from the lightweight mammoth-commons core to retrieve numpy arrays X,y of dataset features and of categorical labels respectively.

% pip install --upgrade pandas
% pip install --upgrade mammoth_commons
import pandas as pd
from mammoth_commons.externals import pd_read_csv
from mammoth_commons.datasets import CSV

# set parameters and load data (modify max_discrete as needed)
path = ...
max_discrete = 10
df = pd_read_csv(path, on_bad_lines="skip")

# identify numeric and categorical columns
num = [col for col in df if pd.api.types.is_any_real_numeric_dtype(df[col])]
num = [col for col in num if len(set(df[col])) > max_discrete]
num_set = set(num)
cat = [col for col in df if col not in num_set]

# convert to numpy data
csv_dataset = CSV(df, num=num, cat=cat[:-1], labels=cat[-1])
X = X.astype(np.float32)
y = df[cat[-1]]

Parameters

Uci

Based on UCI

Loads a dataset from the UCI Machine Learning Repository (www.uci.org) containing numeric, categorical, and predictive data columns. The dataset is automatically downloaded from the repository, and basic preprocessing is applied to identify the column types. The specified target column is treated as the predictive label. To customize the loading process (e.g., use a different target column, load a subset of features, or handle missing data differently), additional parameters or a custom loader can be used.

Parameters

Read any

Based on MMM-Fair

Loads a dataset for analysis from either a pre-loaded pandas DataFrame or a file in one of the supported formats: .csv, .xls, .xlsx, .xlsm, .xlsb, .odf, .ods, .json, .html, or .htm. The module accepts either a raw DataFrame or a file path (local or URL). If a file path is provided, the data is automatically loaded using the appropriate pandas function based on the file extension. Basic preprocessing is applied to infer column types, and the specified target column is treated as the predictive label.

To customize the loading process (e.g., load a subset of columns, handle missing values, or change column type inference), additional parameters or a custom loader function may be provided. The Data loader module is recommended to load and process local data also while training models which are intended to be tested using the ONNXEnsemble module.

Parameters

Custom csv

Based on Pandas

Loads a CSV file that contains numeric, categorical, and predictive data columns separated by a user-defined delimiter. Each row corresponds to a different data sample, with the first one sometimes holding column names (this is automatically detected). To use all data in the file and automate discovery of numerical and categorical columns, as well as of delimiters, use the auto csv loader instead. Otherwise, set here all loading parameters. A pandas CSV reader is employed internally.

Parameters

Csv rankings

This is a Loader to load .csv files with information about researchers The Path should be given relative to your locally running instance (e.g.: ./data/researchers/Top_researchers.csv) The Delimiter should match the CSV file you have (e.g.: '|')

Parameters
Path Delimiter

Researchers

This is a Loader to load .csv files with information about researchers The paperspath and papersaffiliations should be given relative to your locally running instance (e.g.: ./data/researchers/Top_researchers.csv) The Delimiter should match the CSV file you have (e.g.: '|')

Parameters

Graph

Loads the edges of a graph organized as rows of a comma-delimited file.

Parameters

Images

Loads image data from a CSV file holding their sensitive and predictive attribute data, as well as paths relative to a root directory. Loaded images are subjected to a Python transformation.

Parameters

Image pairs

Loads image pairs declared in a CSV file. The expected format is to have the first image's identifier in the first column, and the second image's identifier in the second column, Sensitive attributes can be selected from the rest of the columns. The images identifiers read from the columns are transformed to loading paths by string specifications that can contain the symbols: {root} to refer to the root directory, {col} to refer to the column name, and {id} to refer to the column entry.

Parameters

Free text

Sets a free text that can be used by text-based AI to perform various kinds of analysis, such as detecting biases and sentiment. Some modules may also use this text as a prompt to feed into large language models (LLMs). You may optionally provide a website's URL (starting with http: or https: to retrieve its textual contents.

Parameters