We used MMM-fair to create and compare distribution intersections. The prediction target lies in the inner disk, each ring represents a sensitive attribute. Segments then correspond to intersectional subgroups that occur by combining each attribute's values with (sub)groups of greater granularity. Click on an inner disk partition, the first ring, etc to progressively focus on subgroups of more intersections. Click on the inner disk if it focuses on a specific partition to go back a step. Hover over a segment to see the intersectional group it represents and the proportion of samples in the population contained in that group.
Fairness does not end after producing AI outputs.
💡 Continue interacting with stakeholders to assert that their idea of fairness is correctly implemented.Keep a balance between justifying outputs as part of a fair process and accommodating constructive criticism. Do not over-rely on technical justification, and ensure meaningful human oversight whenever AI systems are deployed in decision-making, high-stakes, or rights-impacting contexts. Human oversight prevents overreliance on imperfect models, catches context-specific errors, and enables ethical judgment, accountability, and recourse for affected people.
An interactive sunburst chart, visualizes how subgroups form and how large or small they are compared to the total dataset. This summarizes the distribution of data across sensitive attributes protected and the prediction target. Sensitive attributes are represented as concentric rings, where each segment corresponds to an intersectional subgroup. Hover over a segment to view its subgroup path and proportion in the dataset, and click on it to focus on the particular intersection.
This report also contains bar charts compare original and augmented distributions for each strategy, as well as references and research findings that you can consult. The annotation r_aug indicates the fraction of synthetic samples added to the dataset under that strategy.
Sampling strategies dictate the number of synthetic samples to generate from each subgroup, to create the final augmented dataset. The following strategies are compared:
For more information, refer to our full paper: "Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study"
The following plots visualize the impact of these strategies on data distribution. In those, r_aug represents the percentage of synthetic samples in the final dataset, providing insight into how much the dataset has been augmented.
Our study compared five state-of-the-art generative methods for synthetic tabular data generation:
The experiments across four real-world datasets (Adult, German credit, Dutch census, and Credit card clients) revealed that:
[1] Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399-410. IEEE (2016)
[2] Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems 32 (2019)
[3] Reiter, J.P.: Using CART to generate partially synthetic public use microdata. Journal of Official Statistics 21(3), 441 (2005)
[4] Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321-357 (2002)