A trainable model to assess the accuracy of probabilistic record linkage

Pita, Robespierre; Mendonça, Everton; Reis, Sandra; Barreto, Marcos; Denaxas, Spiros

Por favor, use este identificador para citar o enlazar este ítem: https://repositorio.ufba.br/handle/ri/24738

metadata.dc.type:	Artigo de Evento
Título :	A trainable model to assess the accuracy of probabilistic record linkage
Autor :	Pita, Robespierre Mendonça, Everton Reis, Sandra Barreto, Marcos Denaxas, Spiros
metadata.dc.creator:	Pita, Robespierre Mendonça, Everton Reis, Sandra Barreto, Marcos Denaxas, Spiros
Resumen :	Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results.
Palabras clave :	Data linkage Machine learning
metadata.dc.publisher.country:	Brasil
Editorial :	Springer, Cham
Citación :	Pita R., Mendonça E., Reis S., Barreto M., Denaxas S. (2017) A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage. In: Bellatreche L., Chakravarthy S. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2017. Lecture Notes in Computer Science, vol 10440. Springer, Cham
metadata.dc.rights:	Acesso Aberto
URI :	http://repositorio.ufba.br/ri/handle/ri/24738
Fecha de publicación :	3-ago-2017
Aparece en las colecciones:	Trabalho Apresentado em Evento (PGCOMP)

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
DaWaK2017_vFinal_104400016.pdf		1,31 MB	Adobe PDF	Visualizar/Abrir

Mostrar el registro Dublin Core completo del ítem

DSpace JSPUI

DSpace almacena y facilita el acceso abierto a todo tipo de contenido digital incluyendo texto, imágenes, vídeos y colecciones de datos.