Synthetic Tabular Data Generation for Privacy-Preserving Machine Learning

Main Article Content

Emory Callahan
Liora MacNeill

Abstract

The increasing demand for machine learning models in sensitive domains such as finance and healthcare has raised significant privacy concerns about training on real-world data. Synthetic tabular data generation offers a promising solution by creating artificial datasets that preserve the statistical properties of the original while mitigating privacy risks. In this paper, we present a comprehensive experimental study on generating privacy-preserving synthetic tabular data using three state-of-the-art generative models: CTGAN, TVAE, and Gaussian Copula. Using real-world datasets including the UCI Adult Income and the U.S. Medical Cost dataset, we compare the generated synthetic data based on three key metrics: utility (measured by downstream task performance), fidelity (statistical similarity to original data), and privacy risk (membership inference attack susceptibility). Our results show that CTGAN achieves superior utility in classification tasks, while Gaussian Copula offers higher privacy robustness. We also propose a hybrid generation-evaluation pipeline that balances data utility and privacy. These findings provide critical insights for practitioners seeking to deploy synthetic data in regulated environments.

Article Details

How to Cite
Callahan, E., & MacNeill, L. (2025). Synthetic Tabular Data Generation for Privacy-Preserving Machine Learning. Journal of Computer Science and Software Applications, 5(7). Retrieved from https://mfacademia.org/index.php/jcssa/article/view/236
Section
Articles