Datasets
Auto csv
![]()
tabular data with common formatting
Uses pandas to load a CSV file that contains numeric, categorical, and predictive data columns. This automatically detects the characteristics of the dataset being loaded, namely the delimiter that separates the columns, and whether each column contains numeric or categorical data. The last categorical column is used as the dataset label. To load the file maintaining more control over options (e.g., a subset of columns, a different label column) use the custom csv loader instead.
How to replicate this data loader during AI creation?
If you want to train a model while using the same loading mechanism as this dataset,
run the following Python script. This uses supporting methods from the lightweight
mammoth-commons core to retrieve numpy
arrays X,y holding dataset features and categorical labels respectively.
% pip install --upgrade pandas % pip install --upgrade mammoth_commons import pandas as pd from mammoth_commons.externals import pd_read_csv from mammoth_commons.datasets import CSV # set parameters and load data (modify max_discrete as needed) path = ... max_discrete = 10 df = pd_read_csv(path, on_bad_lines="skip") # identify numeric and categorical columns num = [col for col in df if pd.api.types.is_any_real_numeric_dtype(df[col])] num = [col for col in num if len(set(df[col])) > max_discrete] num_set = set(num) cat = [col for col in df if col not in num_set] # convert to numpy data csv_dataset = CSV(df, num=num, cat=cat[:-1], labels=cat[-1]) X = X.astype(np.float32) y = df[cat[-1]]Parameters
Uci

tabular datasets from UCI
Loads a dataset from the (UCI) machine learning dataset repository. The dataset contains pre-specified numeric, categorical, and predictive data columns, as well as preprocessing. Available datasets are commonly used in the algorithmic fairness literature to test new approaches.
ParametersCustom csv
![]()
tabular data with custom formatting
Uses pandas to load a CSV file that contains custom specification of numeric, categorical, and predictive data columns. Each row corresponds to a different data sample, with the first one sometimes holding column names (this is automatically detected).
ParametersCsv rankings
![]()
anonymized researcher characteristics
Uses pandas to load CSV file with information about researcher citations, productivity, gender, nationality, country region, and income.
ParametersResearchers
![]()
researcher papers and affiliations
This is a Loader to load .csv URLs with information about citations between researchers, as well as their affiliations.
ParametersGraph
![]()
graph
Loads the edges of a graph organized as rows of a comma-delimited file.
ParametersImages
![]()
images with metadata
Loads image data from a CSV file that contains their sensitive and predictive attribute data, as well as paths relative to a root directory. Loaded images are accompanied by a preprocessing transformation.
ParametersImage pairs
![]()
pairs of images with metadata
Loads image pairs and tabular metadata declared in a CSV file. Images are stored in an independent location (to not move around large swathes of data), and must be accompanied by their preprocessing transformation. Metadata include prediction targets for the pair. like prediction targets or whether at least one of the images exhibits a sensitive attribute. For example, in face verification the prediction target can be whether both images of the pair refer to the same person or not.
How to construct an image pair file?
The expected format is to have the first image's identifier in the first column,
and the second image's identifier in the second column, Sensitive attributes
can be selected from the rest of the columns. The images identifiers read from the columns
are transformed to loading paths by string specifications that can contain the
symbols: {root} to refer to the root directory, {col} to refer to the column name, and {id}
to refer to the column entry.
Free text
![]()
input text or document URL
Sets a free text that can be used by text-based AI to perform various kinds of analysis, such as detecting biases and sentiment. Some modules may also use this text as a prompt to feed into large language models (LLMs). You may optionally provide a website's URL (starting with http: or https:) to retrieve its textual contents.
ParametersRead any

flexible tabular data loading
Uses pandas to load file stored in one of the formats:
.csv, .xls, .xlsx, .xlsm, .xlsb, .odf, .ods, .json, .html, or .htm.
The module is derived from MMM-fair
to support several data formats. Basic preprocessing is applied
to infer column types, and the specified target column is treated as the predictive label.