The thesis aimed to examine the efficiency of synthetic data in increasing the predictive abilities of churn prediction models within imbalanced datasets, particularly in the telecommunication industry. Given that imbalanced datasets were a significant obstacle in the telecommunication sector, the study assessed the impact of including synthetic data in addressing the imbalance. Various synthetic data generation methods, including SMOTENC, ADASYN, TVAE, and CTGAN, were applied to a real-world dataset to achieve this. The goal was to determine to what extent synthetic data could help overcome data imbalance and enhance the predictive capabilities of classification models. Although a significant improvement in the lift score was not achieved, valuable insights into the challenges that come with utilizing synthetically created data were gained. The research highlighted the importance of a consistent and transparent data-cleaning strategy and the need for customized approaches to synthetic data models. The limitations encountered during the study were also discussed, including the use of a limited number of synthetic data models and the dependency on the quality of synthetic data derived from the original data quality. Finally, the thesis offered valuable insights into future research and the practical application of common synthetic data methods on imbalanced real-world datasets in the telco industry.
Date of Award | 8 May 2024 |
---|
Original language | English |
---|
Awarding Institution | - Universidade Católica Portuguesa
|
---|
Supervisor | Nuno Filipe Loureiro Paiva (Supervisor) |
---|
- Synthetic data
- Churn prediction
- Imbalanced datasets
- Telecommunication industry
- SMOTENC
- ADASYN
- TVAE
- CTGAN
- Lift score
- Data quality
- Mestrado em Análise de Dados para Gestão
How can synthetic data generation techniques, enhance the lift accuracy of churn prediction models in imbalanced datasets from the telecommunications sector?
Salii, J. (Student). 8 May 2024
Student thesis: Master's Thesis