Analysis of a Probabilistic Record Linkage Technique without Human Review

From Clinfowiki
Jump to: navigation, search

Article by: Grannis, S. J., Overhage, J. M., Hui, S., & McDonald, C. J. (2003)


Record linkage is the process of combining information from two or more databases about an individual, family, or entity. A method of record linkage is the probabilistic linkage without human intervention. With this methodology, an algorithm is used to generate a match of the likelihood score, which is compared to a predetermined threshold for which, if this likelihood score is above a link is established and below it is a non-link. [1]


The authors compared the performance of a deterministic method (from a previous study) to an unsupervised probabilistic method using the say gold-standard datasets for two hospital registries. In this particular study, the authors generated match likelihood scores for each record-pair using the Felligi-Sunter model which sums the component weights of each identifier in the record pair. Each pair was labeled as linked or non-linked. To ensure non-human review, the authors used an estimator function using the Expectation Maximization (EM) [2]


The authors reported a 99.98% and 99.80% true link and identifier agreement for registry A for manual review and EM estimator respectively. For registry B, they reported 99.99% and 99.89% for manual review and EM estimator respectively. The authors also reported an improvement in the sensitivity and specificity with use of the probabilistic method over the deterministic method (about 6 to 7 percent improvement in sensitivities with minimal decrease in specificity).


In record linkage in which human intervention is not practical or possible, the use of the EM algorithm accurately estimated linkage parameters.

Remarks about the article

The methodology used in this study is limited to small datasets. The methodology is limited in that the authors didn’t take into consideration minor spelling variation and topographical errors in data. It would have been helpful as well for the authors to include a website where reviewers and critics can reproduce or run their algorithm on sample datasets to test out accuracy as reported.

In addition, this article was published in 2003 when it was more likely to have several department in a hospital to assign unique patient identifiers for each area. In Radiology for example, an "imaging number" was assigned to each patient in addition to their medical record number. The effort to make sure patients have only one record containing all their history continues to be front center today as it was in 2003.

Related Topics

Master patient index

Performance of probabilistic method to detect duplicate individual case safety reports

Matching identifiers in electronic health records: implications for duplicate records and patient safety

Improving record linkage performance in the presence of missing linkage data


  1. Grannis, 2003. Analysis of a Probabilistic Record Linkage Technique without Human Review
  2. Borman, S. (2009, January 9). The Expectation Maximization Algorithm -- A short tutorial. Retrieved October 22, 2015, from