Difference between revisions of "Improving record linkage performance in the presence of missing linkage data"

From Clinfowiki
Jump to: navigation, search
Line 4: Line 4:
  
 
== Introduction ==
 
== Introduction ==
In the absence of accurate and universal patient identifiers, record linkage methods use non-unique fields to link two or more records belonging to the same individual.  These quasi-identifiers (e.g. date of birth, lastname, place of birth, etc) are fields that when combined together may uniquely identify an individual.  In the healthcare setting, such identifiers may be missing due to a multitude of reasons, making record linkage very difficult.  In relational databases used to store medical records, two or more records are linked using a primary key that must be unique and should not be missing but missing data are the fact of life in healthcare research. <ref name = "2014, Ong et al.">Ong, 2014. Improving record linkage performance in the presence of missing linkage data http://www.j-biomed-inform.com/article/S1532-0464(14)00019-7/pdf
+
In the absence of accurate and universal patient identifiers, record linkage methods use non-unique fields to link two or more records belonging to the same individual.  These quasi-identifiers (e.g. date of birth, lastname, place of birth, etc) are fields that when combined together may uniquely identify an individual.  In the healthcare setting, such identifiers may be missing due to a multitude of reasons, making record linkage very difficult.  In relational databases used to store medical records, two or more records are linked using a primary key that must be unique and should not be missing but missing data are the fact of life in healthcare research. <ref name = "2014, Ong et al.">Ong, 2014. Improving record linkage performance in the presence of missing linkage data http://www.j-biomed-inform.com/article/S1532-0464(14)00019-7/pdf/</ref>
  
 
== Background ==
 
== Background ==
There are two main approaches to matching two or more records using identifiers: deterministic and probabilistic.  Deterministic methods link records based on exact agreement/disagreement of a combination of quasi-identifiers.  Deterministic approaches are unable to match records with typographical or phonetic errors.  Probabilistic methods calculate a likelihood score to determine if two records refer to the same person.  The most common method is the Fellegi-Sunter (FS) method which considers each pair of quasi-identifier in a record pair to be either match or un-match based on the assigned matching weights. <ref name = "2014, Ong et al.">Ong, 2014. Improving record linkage performance in the presence of missing linkage data http://www.j-biomed-inform.com/article/S1532-0464(14)00019-7/pdf
+
There are two main approaches to matching two or more records using identifiers: deterministic and probabilistic.  Deterministic methods link records based on exact agreement/disagreement of a combination of quasi-identifiers.  Deterministic approaches are unable to match records with typographical or phonetic errors.  Probabilistic methods calculate a likelihood score to determine if two records refer to the same person.  The most common method is the Fellegi-Sunter (FS) method which considers each pair of quasi-identifier in a record pair to be either match or un-match based on the assigned matching weights. <ref name = "2014, Ong et al.">Ong, 2014. Improving record linkage performance in the presence of missing linkage data http://www.j-biomed-inform.com/article/S1532-0464(14)00019-7/pdf/</ref>
  
 
== Method ==
 
== Method ==
The authors used the open-source Fine-grained record Linkage (FRIL) software that extends the distance algorithms and FS scoring methods to develop three methods (Weight Redistribution, Distance Imputation and Linkage Expansion).    Weight Redistribution removes fields with missing data sets and redistributes the weights based on the remaining available linkage fields.  Distance Imputation imputes the distance between the missing data fields.  Linkage Expansion adds previously considered non-linkage fields to the linkage field set to compensate for the missing information in linkage field. To test this methodology, the authors created two paired datasets initially containing 5000 records each that contains 9 fields with simulated values with varying corruption rates. <ref name = "2014, Ong et al.">Ong, 2014. Improving record linkage performance in the presence of missing linkage data http://www.j-biomed-inform.com/article/S1532-0464(14)00019-7/pdf
+
The authors used the open-source Fine-grained record Linkage (FRIL) software that extends the distance algorithms and FS scoring methods to develop three methods (Weight Redistribution, Distance Imputation and Linkage Expansion).    Weight Redistribution removes fields with missing data sets and redistributes the weights based on the remaining available linkage fields.  Distance Imputation imputes the distance between the missing data fields.  Linkage Expansion adds previously considered non-linkage fields to the linkage field set to compensate for the missing information in linkage field. To test this methodology, the authors created two paired datasets initially containing 5000 records each that contains 9 fields with simulated values with varying corruption rates. <ref name = "2014, Ong et al.">Ong, 2014. Improving record linkage performance in the presence of missing linkage data http://www.j-biomed-inform.com/article/S1532-0464(14)00019-7/pdf/</ref>
  
 
== Results ==
 
== Results ==
The method developed in this research did better than previous methods of record linkage.  This study had a sensitivity ranging from .895 to .992 and Positive predictive values (PPV) ranging from 0.865 to 1.00 in data sets with low corruption rate.  The authors also found increased corruption rates lead to decreased sensitivity in all methods.   
+
The method developed in this research did better than previous methods of record linkage.  This study had a sensitivity ranging from .895 to .992 and Positive predictive values (PPV) ranging from 0.865 to 1.00 in data sets with low corruption rate.  The authors also found increased corruption rates lead to decreased sensitivity in all methods.  <ref name = "2014, Ong et al.">Ong, 2014. Improving record linkage performance in the presence of missing linkage data http://www.j-biomed-inform.com/article/S1532-0464(14)00019-7/pdf/</ref>
  
 
== Conclusion==
 
== Conclusion==
Line 19: Line 19:
 
      
 
      
 
== Remarks about the article ==
 
== Remarks about the article ==
The methodology used in this study address a lot of the issues from the previously reviewed similar studies.  First it can be used with missing values in some fields.  Second their methods are hybrid of deterministic and probabilistic methods.  The authors gave access to the source codes, datasets, and the documentation used in the research (https://github.com/recordlinkagerep/missingdataproject).  This is helpful in reproducing their results using their datasets or applying to other similar data sets. <ref name = "2014, Ong et al.">Ong, 2014. Improving record linkage performance in the presence of missing linkage data http://www.j-biomed-inform.com/article/S1532-0464(14)00019-7/pdf
+
The methodology used in this study address a lot of the issues from the previously reviewed similar studies.  First it can be used with missing values in some fields.  Second their methods are hybrid of deterministic and probabilistic methods.  The authors gave access to the source codes, datasets, and the documentation used in the research (https://github.com/recordlinkagerep/missingdataproject).  This is helpful in reproducing their results using their datasets or applying to other similar data sets. <ref name = "2014, Ong et al.">Ong, 2014. Improving record linkage performance in the presence of missing linkage data http://www.j-biomed-inform.com/article/S1532-0464(14)00019-7/pdf/</ref>
  
 
==Related Topics==
 
==Related Topics==

Revision as of 22:49, 21 October 2015

Article by:Ong, T. C., Mannino, M. V., Schilling, L. M., & Kahn, M. G. (2014)


Introduction

In the absence of accurate and universal patient identifiers, record linkage methods use non-unique fields to link two or more records belonging to the same individual. These quasi-identifiers (e.g. date of birth, lastname, place of birth, etc) are fields that when combined together may uniquely identify an individual. In the healthcare setting, such identifiers may be missing due to a multitude of reasons, making record linkage very difficult. In relational databases used to store medical records, two or more records are linked using a primary key that must be unique and should not be missing but missing data are the fact of life in healthcare research. [1]

Background

There are two main approaches to matching two or more records using identifiers: deterministic and probabilistic. Deterministic methods link records based on exact agreement/disagreement of a combination of quasi-identifiers. Deterministic approaches are unable to match records with typographical or phonetic errors. Probabilistic methods calculate a likelihood score to determine if two records refer to the same person. The most common method is the Fellegi-Sunter (FS) method which considers each pair of quasi-identifier in a record pair to be either match or un-match based on the assigned matching weights. [1]

Method

The authors used the open-source Fine-grained record Linkage (FRIL) software that extends the distance algorithms and FS scoring methods to develop three methods (Weight Redistribution, Distance Imputation and Linkage Expansion). Weight Redistribution removes fields with missing data sets and redistributes the weights based on the remaining available linkage fields. Distance Imputation imputes the distance between the missing data fields. Linkage Expansion adds previously considered non-linkage fields to the linkage field set to compensate for the missing information in linkage field. To test this methodology, the authors created two paired datasets initially containing 5000 records each that contains 9 fields with simulated values with varying corruption rates. [1]

Results

The method developed in this research did better than previous methods of record linkage. This study had a sensitivity ranging from .895 to .992 and Positive predictive values (PPV) ranging from 0.865 to 1.00 in data sets with low corruption rate. The authors also found increased corruption rates lead to decreased sensitivity in all methods. [1]

Conclusion

Depending on the performance goal of the record linkage process, the three new methods responded well to big data sets with the missing values in some fields but none has 100% sensitivity with 100% specificity. Cite error: Closing </ref> missing for <ref> tag

Related Topics

Master patient index

Performance of probabilistic method to detect duplicate individual case safety reports

Matching identifiers in electronic health records: implications for duplicate records and patient safety

Reference

  1. 1.0 1.1 1.2 1.3 Ong, 2014. Improving record linkage performance in the presence of missing linkage data http://www.j-biomed-inform.com/article/S1532-0464(14)00019-7/pdf/