Datasets

Auto csv

csv

tabular data with common formatting

Uses pandas to load a CSV file that contains numeric, categorical, and predictive data columns. This automatically detects the characteristics of the dataset being loaded, namely the delimiter that separates the columns, and whether each column contains numeric or categorical data. The last categorical column is used as the dataset label. To load the file maintaining more control over options (e.g., a subset of columns, a different label column) use the custom csv loader instead.

How to replicate this data loader during AI creation? If you want to train a model while using the same loading mechanism as this dataset, run the following Python script. This uses supporting methods from the lightweight mammoth-commons core to retrieve numpy arrays X,y holding dataset features and categorical labels respectively.

% pip install --upgrade pandas
% pip install --upgrade mammoth_commons
import pandas as pd
from mammoth_commons.externals import pd_read_csv
from mammoth_commons.datasets import CSV

# set parameters and load data (modify max_discrete as needed)
path = ...
max_discrete = 10
df = pd_read_csv(path, on_bad_lines="skip")

# identify numeric and categorical columns
num = [col for col in df if pd.api.types.is_any_real_numeric_dtype(df[col])]
num = [col for col in num if len(set(df[col])) > max_discrete]
num_set = set(num)
cat = [col for col in df if col not in num_set]

# convert to numpy data
csv_dataset = CSV(df, num=num, cat=cat[:-1], labels=cat[-1])
X = X.astype(np.float32)
y = df[cat[-1]]

Parameters

Uci

UCI

tabular datasets from UCI

Loads a dataset from the (UCI) machine learning dataset repository. The dataset contains pre-specified numeric, categorical, and predictive data columns, as well as preprocessing. Available datasets are commonly used in the algorithmic fairness literature to test new approaches.

Parameters

Custom csv

csv

tabular data with custom formatting

Uses pandas to load a CSV file that contains custom specification of numeric, categorical, and predictive data columns. Each row corresponds to a different data sample, with the first one sometimes holding column names (this is automatically detected).

Parameters

Csv rankings

csv

anonymized researcher characteristics

Uses pandas to load CSV file with information about researcher citations, productivity, gender, nationality, country region, and income.

Parameters

Researchers

graph

researcher papers and affiliations

This is a Loader to load .csv URLs with information about citations between researchers, as well as their affiliations.

Parameters

Graph

graph

graph

Loads the edges of a graph organized as rows of a comma-delimited file.

Parameters

Images

image

images with metadata

Loads image data from a CSV file that contains their sensitive and predictive attribute data, as well as paths relative to a root directory. Loaded images are accompanied by a preprocessing transformation.

Parameters

Image pairs

images

pairs of images with metadata

Loads image pairs and tabular metadata declared in a CSV file. Images are stored in an independent location (to not move around large swathes of data), and must be accompanied by their preprocessing transformation. Metadata include prediction targets for the pair. like prediction targets or whether at least one of the images exhibits a sensitive attribute. For example, in face verification the prediction target can be whether both images of the pair refer to the same person or not.

How to construct an image pair file? The expected format is to have the first image's identifier in the first column, and the second image's identifier in the second column, Sensitive attributes can be selected from the rest of the columns. The images identifiers read from the columns are transformed to loading paths by string specifications that can contain the symbols: {root} to refer to the root directory, {col} to refer to the column name, and {id} to refer to the column entry.

Parameters

Free text

text

input text or document URL

Sets a free text that can be used by text-based AI to perform various kinds of analysis, such as detecting biases and sentiment. Some modules may also use this text as a prompt to feed into large language models (LLMs). You may optionally provide a website's URL (starting with http: or https:) to retrieve its textual contents.

Parameters

Read any

MMM-Fair

flexible tabular data loading

Uses pandas to load file stored in one of the formats: .csv, .xls, .xlsx, .xlsm, .xlsb, .odf, .ods, .json, .html, or .htm. The module is derived from MMM-fair to support several data formats. Basic preprocessing is applied to infer column types, and the specified target column is treated as the predictive label.

Parameters