We used MMM-fair to create and compare distribution intersections. The prediction target lies in the inner disk, each ring represents a sensitive attribute. Segments then correspond to intersectional subgroups that occur by combining each attribute's values with (sub)groups of greater granularity. Click on an inner disk partition, the first ring, etc to progressively focus on subgroups of more intersections. Click on the inner disk if it focuses on a specific partition to go back a step. Hover over a segment to see the intersectional group it represents and the proportion of samples in the population contained in that group.

Fairness does not end after producing AI outputs.

💡 Continue interacting with stakeholders to assert that their idea of fairness is correctly implemented.
💡 Monitor the outputs of deployed systems by rerunning the analysis on updated models and datasets.
💡 Test model and dataset variations for multiple sensitive characteristics and parameters.

Keep a balance between justifying outputs as part of a fair process and accommodating constructive criticism. Do not over-rely on technical justification, and ensure meaningful human oversight whenever AI systems are deployed in decision-making, high-stakes, or rights-impacting contexts. Human oversight prevents overreliance on imperfect models, catches context-specific errors, and enables ethical judgment, accountability, and recourse for affected people.

An interactive sunburst chart, visualizes how subgroups form and how large or small they are compared to the total dataset. This summarizes the distribution of data across sensitive attributes education, marital and the prediction target. Sensitive attributes are represented as concentric rings, where each segment corresponds to an intersectional subgroup. Hover over a segment to view its subgroup path and proportion in the dataset, and click on it to focus on the particular intersection.

tabular data with custom formatting

path: /home/maniospas/Documents/mammoth-commons/data/bank/bank.csv
delimiter: ;
numeric: age, duration, campaign, pdays, previous
categorical: marital, job, education, default, housing, contact, loan, poutcome
label: y
skip invalid lines: True
Uses pandas to load a CSV file that contains custom specification of numeric, categorical, and predictive data columns. Each row corresponds to a different data sample, with the first one sometimes holding column names (this is automatically detected).

This report also contains bar charts compare original and augmented distributions for each strategy, as well as references and research findings that you can consult. The annotation r_aug indicates the fraction of synthetic samples added to the dataset under that strategy.

Sampling strategies dictate the number of synthetic samples to generate from each subgroup, to create the final augmented dataset. The following strategies are compared:

Class: Balances the class distribution within each group separately by sampling the minority class.
Class & Protected: Ensures equal sample distribution across all subgroups by sampling both majority and minority classes.
Protected: Balances the number of instances across different groups without considering class labels.
Class (Ratio): Maintains the same class ratio across all groups as found in the largest group.

For more information, refer to our full paper: "Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study"

The following plots visualize the impact of these strategies on data distribution. In those, r_aug represents the percentage of synthetic samples in the final dataset, providing insight into how much the dataset has been augmented.

Augmentation Strategies for sensitive attribute education

Augmentation Strategies for sensitive attribute marital

Generative Models for Oversampling

Our study compared five state-of-the-art generative methods for synthetic tabular data generation:

SDV-GC [1]: Uses various continuous distributions to model features and a multivariate Gaussian Copula to estimate feature covariance.
CTGAN [2]: Adapts GANs for tabular data with mode-specific normalization to overcome imbalances.
TVAE [2]: Trains a Variational Autoencoder to learn a low-dimensional Gaussian latent space for sampling.
CART [3]: A tree-based method for column-wise data generation that samples in the leaves, suitable for mixed data types.
SMOTE-NC [4]: A non-parametric method that generates samples by interpolating between line segments connecting real instances.

Key Findings

The experiments across four real-world datasets (Adult, German credit, Dutch census, and Credit card clients) revealed that:

The non-parametric CART model emerged as the top performer in most cases, showing superior results for both utility and fairness metrics.
CART was significantly more computationally efficient than other methods.
The class (ratio) sampling strategy generally led to the best fairness metrics while maintaining high utility, as it (usually) requires fewer synthetic samples (lower r_aug) to achieve equal class ratios between different subgroups.

References

[1] Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399-410. IEEE (2016)

[2] Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems 32 (2019)

[3] Reiter, J.P.: Using CART to generate partially synthetic public use microdata. Journal of Official Statistics 21(3), 441 (2005)