Supervisor

Dr. Malika Benedechache

Programme

MSc in Data Analytics

Abstract

This research investigates the viability of anonymization and synthetic data generation in the area of big data so that the data could be shared across borders and exist outside the constraints of privacy laws. These privacy laws are growing around the world to help protect individual identity and prevent open sharing of private data. These privacy laws all provide guidance on how data may be shared and the strict conditions upon how that may occur. Two methods which are growing in popularity are anonymization of data, specifically k-Anonymity, l-Diversity and t-Closeness, and generating synthetic data from a real dataset leveraging machine learning techniques. This paper explores some of these techniques and aims to effectively measure them as a solution to allow organizations to share big data outside of the constraints of privacy laws. The areas of measurement addressed are risk, utility, and usability. A number of measurements are discussed within the paper and implemented within the artifact to allow for comparative testing of different datasets. The focus for this paper is on healthcare and financial data. For anonymization, it was important to understand the quasi-identifiers within the datasets and the sensitive attributes that needed to be considered. These details were used to conduct the measurements around risk and utility. Synthetic data needed to be measured to understand how similar it was to the real data and if any potential leaks of the real data occurred. Both were measured separately, but for usability were tested together across several machine learning models. Across both experiments in healthcare and finance, the results showed that anonymized data contained minimal utility while introducing risk, while real synthetic data performed well, retained utility and demonstrated very low risk. That said, the usability measure showed that synthetic data, while close, doesn’t perform exactly the same as the real data, which could be an issue depending on use case. In conclusion, the synthetic version of the anonymized data appears to be a viable option that could be shared with low risk, good utility and potentially good usability. Keywords:

Date of Award

2025

Full Publication Date

2025

Access Rights

open access

Document Type

Capstone Project

Resource Type

thesis

Included in

Data Science Commons

Share

COinS