Supervisor

Dr. Muhammad Iqbal

Programme

MSc in Data Analytics

Subject

Computer Science

Abstract

Lapses are an issue in the insurance industry in general. They affect a company’s profitability, cash flows and solvency. High levels of lapses can cause reputational damage that could provoke a cycle of even more lapses. It is therefore incumbent on a company to do its utmost to retain the business it has written for the term it was written for.

If a company could predict which of its policies were about to lapse, it could proactively attempt to prevent them by contacting the policyholder and engaging in a discussion to ascertain the likelihood of their choosing to leave. In this paper, various machine learning tools will be employed on a set of life company policy and client data. The tools include sentiment analysis, Random Forests, Artificial Neural Networks (ANN), k-Nearest Neighbour (kNN) and Support Vector Machine Classification. Among the metrics examined will be sentiment over time, confusion matrices and accuracy of the Random Forests, ANN, kNN and SVM.

The data is derived from one company’s life assurance policy and policyholder data as stored on its administration systems and extracted to a SQL database. As there are two different systems, the data from both had to be transformed into a canonical format. Also, as the models used optimally need numerical data, some categorical data had to be transformed into numerical data. Separately, data reflecting policyholder sentiment was also captured.

The modelling found that the Random Forest model was the most accurate, with accuracies of 88.18% for single life and 83.42% for joint life data. The next most accurate was a kNN with accuracies of 87.53% and 81.38%. Then follows ANN with accuracies of the order of 87.83% and 79.69% respectively. (kNN rated higher due to its overall better performance). A Support Vector Machine (trained on the optimal parameters found by a 5-fold cross validation on 12 combinations of parameters) correctly identified 84.99% of cases.

A market basket analysis was carried out (using the apriori algorithm) to see what combinations of benefits were present in the customer base and the results are summarised below in section 4.4 and detailed in the appendix. The results can be used to aid customers in adding benefits to their policies (along with the appropriate checks from a qualified intermediary).

A possible future extension for this work is to split the modelling across multiple PCs, so the training could be run for longer (for example more decision trees in the Random Forest, more training epochs for the neural networks or more parameters in the Grid search on the SVM) or on a distributed system. Other future work could consist of explicit modelling by product type which would produce several smaller but better trained models. The input data could be enhanced if data that is on the admin systems but not currently in the extracts gets added. This could provide more granular policy, life and transaction history information which could refine the models.

Date of Award

2025

Full Publication Date

2025

Access Rights

open access

Document Type

Capstone Project

Resource Type

thesis

Included in

Data Science Commons

Share

COinS