Date of Award


Document Type

Capstone Project


MSc in Data Analytics


Muhammad Iqbal


Data analytics tools are becoming increasingly common in the life insurance industry. This research considers two use cases for predictive analytics in a life insurance company based in Ireland. The first case study relates to the use of time series models to forecast the seasonality of death claim notifications. The baseline model predicted no seasonal variation in death claim notifications over a calendar year. This reflects the life insurance company’s current approach, whereby it is assumed that claims are notified linearly over a calendar year. More accurate forecasting of death claims seasonality would enhance the life insurance company’s cashflow planning and analysis of financial results. The performance of five time series models was compared against the baseline model. The time series models included a simple historical average model, a classical SARIMA model, the Random Forest Regressor and Prophet machine learning models and the LSTM deep learning model. The models were trained on both the life insurance company’s historical death claims data and on Irish population deaths data for the 25-74 age cohort over the same observation periods. The results demonstrated that machine learning time series models were generally more effective than the baseline model in forecasting death claim seasonality. It was also demonstrated that models trained on both Irish population deaths and the life insurance company’s historical death claims could outperform the baseline model. The best forecaster was Facebook’s Prophet model, trained on the life insurance company’s claims data. Each of the models trained on Irish population deaths data outperformed the baseline model. The SARIMA and LSTM consistently underperformed the baseline model when both were trained on death claims data. All models performed better when claims directly related to Covid-19 were removed from the testing data. The second case study relates to the use of classification models to predict protection policy lapse behaviour following a policy review. The life insurance company currently has no method of predicting individual policy lapses, hence the baseline model assumed that all policies had an equal probability of lapsing. More accurate prediction of policy review lapse outcomes would enhance the life insurance company’s profit forecasting ability. It would also provide the company with the opportunity to potentially reduce lapse rates at policy review by tailoring alternative options for certain groups of policyholders. The performance of 12 classification models was assessed against the baseline model - KNN, Naïve Bayes, Support Vector Machine, Decision Tree, Random Forest, Extra Trees, XGBoost, LightGBM, AdaBoost and Multi-Layer Perceptron (MLP). To address class imbalance in the data, 11 rebalancing techniques were assessed. These included cost-sensitive algorithms (Class Weight Balancing), oversampling (Random Oversampling, ADASYN, SMOTE, Borderline SMOTE), undersampling (Random Undersampling, and Near Miss versions 1 to 3) as well as a combination of oversampling and undersampling (SMOTETomek and SMOTEENN). When combined with rebalancing methods, the predictive capacity of the classification models outperformed the baseline model in almost every case. However, results varied by train/test split and by evaluation metric. Oversampling models performed best on F1 Score and ROC-AUC while SMOTEENN and the undersampling models generated the highest levels of Recall. The top F1 Score was generated by the Naïve Bayes model when combined with SMOTE. The MLP model generated the highest ROC-AUC when combined with BorderlineSMOTE. The results of both case studies demonstrate that data analytics techniques can enhance a life insurance company’s predictive toolkit. It is recommended that further opportunities to enhance the predictive ability of the time series and classification models be explored.