Exploring the potential of probabilistic record linkage in healthcare
: a study on matching national provider identifier records with social network profiles

  • Florian Jürgen Pullem (Student)

Student thesis: Master's Thesis


In the digital era, a wealth of heterogeneous data is collected globally about various entities such as individuals, professionals, or companies. Extracting value from this data requires linking individual data points that describe the same entity. However, the diversity of sources and the absence of a unique identifier complicate this process. This study addresses this challenge by exploring the potential of probabilistic record linkage techniques to associate entries in the National Provider Identifier (NPI) database with physician’s social network profiles. The research was conducted in collaboration with Alpha Sophia, a startup aiming to build a leading commercial intelligence platform for the US healthcare market. The thesis proposes an innovative strategy for generating labeled data, which comprises a combination of deterministic record linkage and noise injection. This strategy facilitates the implementation of various supervised learning models, such as random forest, alongside the benchmark, the Fellegi-Sunter model. The primary finding is the superior performance of supervised models over the benchmark, demonstrating the advantage of the innovative approach. Over 142 thousand new matches were identified while maintaining a minimal false positive rate. This equates to an approximate 64% increase in the total number of linked data records compared to the number of matches discovered through traditional methods. Moreover, cost savings exceeding 68 thousand euros were realized. The methodologies and model presented can be tailored to address other linkage challenges that Alpha Sophia and other companies encounter. It is recommended to employ the outlined techniques in diverse contexts with varying datasets in the future.
Date of Award23 Jan 2024
Original languageEnglish
Awarding Institution
  • Universidade Católica Portuguesa
SupervisorNicolò Bertani (Supervisor)


  • Probabilistic record linkage
  • Noise injection
  • National Provider Identifier (NPI)
  • Fellegi-Sunter model
  • Logistic regression
  • Random forest


  • Mestrado em Análise de Dados para Gestão

Cite this