Improving record linkage performance in the presence of missing linkage data

From Clinfowiki
Jump to: navigation, search

Article by:Ong, T. C., Mannino, M. V., Schilling, L. M., & Kahn, M. G. (2014)


In the absence of accurate and universal patient identifiers, record linkage methods use non-unique fields to link two or more records belonging to the same individual. These quasi-identifiers (e.g. date of birth, lastname, place of birth, etc) are fields that when combined together may uniquely identify an individual. In the healthcare setting, such identifiers may be missing due to a multitude of reasons, making record linkage very difficult. In relational databases used to store medical records, two or more records are linked using a primary key that must be unique and should not be missing but missing data are the fact of life in healthcare research. [1]


There are two main approaches to matching two or more records using identifiers: deterministic and probabilistic. Deterministic methods link records based on exact agreement/disagreement of a combination of quasi-identifiers. Deterministic approaches are unable to match records with typographical or phonetic errors. Probabilistic methods calculate a likelihood score to determine if two records refer to the same person. The most common method is the Fellegi-Sunter (FS) method which considers each pair of quasi-identifier in a record pair to be either match or un-match based on the assigned matching weights. [1]


The authors used the open-source Fine-grained record Linkage (FRIL) software that extends the distance algorithms and FS scoring methods to develop three methods (Weight Redistribution, Distance Imputation and Linkage Expansion). Weight Redistribution removes fields with missing data sets and redistributes the weights based on the remaining available linkage fields. Distance Imputation imputes the distance between the missing data fields. Linkage Expansion adds previously considered non-linkage fields to the linkage field set to compensate for the missing information in linkage field. To test this methodology, the authors created two paired datasets initially containing 5000 records each that contains 9 fields with simulated values with varying corruption rates. [1]


The method developed in this research did better than previous methods of record linkage. This study had a sensitivity ranging from .895 to .992 and Positive predictive values (PPV) ranging from 0.865 to 1.00 in data sets with low corruption rate. The authors also found increased corruption rates lead to decreased sensitivity in all methods. [1]


Depending on the performance goal of the record linkage process, the three new methods responded well to big data sets with the missing values in some fields but none has 100% sensitivity with 100% specificity. [1]

Remarks about the article

The methodology used in this study address a lot of the issues from the previously reviewed similar studies. First it can be used with missing values in some fields. Second their methods are hybrid of deterministic and probabilistic methods. The authors gave access to the source codes, datasets, and the documentation used in the research ( This is helpful in reproducing their results using their datasets or applying to other similar data sets. [1]

Related Topics

Master patient index

Performance of probabilistic method to detect duplicate individual case safety reports

Matching identifiers in electronic health records: implications for duplicate records and patient safety


  1. 1.0 1.1 1.2 1.3 1.4 1.5 Ong, 2014. Improving record linkage performance in the presence of missing linkage data