Synthetic Tabular Data Generation for Privacy-Preserving Machine Learning
Main Article Content
Abstract
The increasing demand for machine learning models in sensitive domains such as finance and healthcare has raised significant privacy concerns about training on real-world data. Synthetic tabular data generation offers a promising solution by creating artificial datasets that preserve the statistical properties of the original while mitigating privacy risks. In this paper, we present a comprehensive experimental study on generating privacy-preserving synthetic tabular data using three state-of-the-art generative models: CTGAN, TVAE, and Gaussian Copula. Using real-world datasets including the UCI Adult Income and the U.S. Medical Cost dataset, we compare the generated synthetic data based on three key metrics: utility (measured by downstream task performance), fidelity (statistical similarity to original data), and privacy risk (membership inference attack susceptibility). Our results show that CTGAN achieves superior utility in classification tasks, while Gaussian Copula offers higher privacy robustness. We also propose a hybrid generation-evaluation pipeline that balances data utility and privacy. These findings provide critical insights for practitioners seeking to deploy synthetic data in regulated environments.
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.
Mind forge Academia also operates under the Creative Commons Licence CC-BY 4.0. This allows for copy and redistribute the material in any medium or format for any purpose, even commercially. The premise is that you must provide appropriate citation information.