Supervisor

Dr. Muhammad Iqbal

Programme

MSc in Data Analytics

Subject

Computer Science

Abstract

There are growing restraints when it comes to Real World Data (RWD), these include topics such as privacy regulations, ethical concerns, and the cost of collecting the data, and they have drove an interest in AI-generated synthetic data as a potential alternative in predictive analytics. This project examines the possibilities of synthetic data and if it can act as a reliable substitute for RWD in predictive modelling. This project uses Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP) to generate synthetic reproductive health data and evaluates its predictive performance against RWD using linear regression and key metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and R².

By completing a comparative analysis, this research assesses the viability of synthetic data in predicative forecasting, drawing on both its strengths and limitations. The findings show that synthetic data that is created by using machine learning, deep learning, and statistical validation can achieve predictive performance (97.8%) comparable to RWD (99.7%) under certain conditions (such as number of epochs and number of samples), offering an avenue for enhancing data accessibility while mitigating privacy risks. This research also investigates mode collapse, and parameters that can help mitigate against that, with an R2 result of 10% being increased to 97.8% by altering the WGAN-GP generator.

By integrating privacy evaluations and ethical considerations, this study contributes to the ongoing discourse on synthetic data applications, offering insights into its role in advancing data-driven decision-making across healthcare.

Date of Award

2025

Full Publication Date

2025

Access Rights

open access

Document Type

Capstone Project

Resource Type

thesis

Included in

Data Science Commons

Share

COinS