Difference between revisions of "Health Record De-identification and Anonymization"

From Clinfowiki
Jump to: navigation, search
Line 4: Line 4:
  
 
The terms “de-identification” and “anonymization” are often vaguely and inconsistently defined in the research literature[https://www.jmir.org/2019/5/e13484/]. The proponents of differentiating the two terms argue that while de-identification refers to removal or replacement of personal identifiers so that reestablishing a link between the individual and the data is difficult, anonymization refers to irreversible removal of the link between the individual and the data so that reestablishing the link is virtually impossible. While de-identification required to comply with regulations like HIPAA are relatively easily adhered to by masking well-defined categories of data, it also carries a higher risk of re-identification[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6502465/]. Anonymization, on the other hand, requires further data manipulation and is shaped by the regional laws of where it takes place, which tend to be vague with their definitions[https://www.jmir.org/2019/5/e13484/].
 
The terms “de-identification” and “anonymization” are often vaguely and inconsistently defined in the research literature[https://www.jmir.org/2019/5/e13484/]. The proponents of differentiating the two terms argue that while de-identification refers to removal or replacement of personal identifiers so that reestablishing a link between the individual and the data is difficult, anonymization refers to irreversible removal of the link between the individual and the data so that reestablishing the link is virtually impossible. While de-identification required to comply with regulations like HIPAA are relatively easily adhered to by masking well-defined categories of data, it also carries a higher risk of re-identification[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6502465/]. Anonymization, on the other hand, requires further data manipulation and is shaped by the regional laws of where it takes place, which tend to be vague with their definitions[https://www.jmir.org/2019/5/e13484/].
 
  
 
== Two strategies of de-identification ==
 
== Two strategies of de-identification ==
  
 
There are two general approaches to locating patient identifiers: lexical, pattern-based systems and machine learning-based systems, each with their advantages and disadvantages. Studies evaluating pattern-based systems have reported good performance (especially precision), but at the cost of months of work by domain experts and limited generalizability[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6502465/]. While this method requires little to no annotated training data and can be easily modified to improve performance by adding rules, terms or expressions, the terms, such complex rules are suited to a particular dataset and rarely carry over to different datasets[https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-10-70]. More recently, various machine learning algorithms combined with natural language processing tasks such as Named-Entity Recognition (NER)and Part-of-Speech (POS) taggers gained traction [https://www.nature.com/articles/s41598-020-75544-1]. They main advantages of machine leaning de-identification methods are that they can be used “out of the box” with minimal development time and with only little knowledge of PHI patterns to a specific document types or domains. The main disadvantage of machine learning methods is the need for large amounts of annotated training data, although software tools can enhance this process [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6502465/]. Another disadvantage is, by its non-traceable nature, that it is difficult to know precisely why the application committed an error and how it can be debugged[https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-10-70].
 
There are two general approaches to locating patient identifiers: lexical, pattern-based systems and machine learning-based systems, each with their advantages and disadvantages. Studies evaluating pattern-based systems have reported good performance (especially precision), but at the cost of months of work by domain experts and limited generalizability[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6502465/]. While this method requires little to no annotated training data and can be easily modified to improve performance by adding rules, terms or expressions, the terms, such complex rules are suited to a particular dataset and rarely carry over to different datasets[https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-10-70]. More recently, various machine learning algorithms combined with natural language processing tasks such as Named-Entity Recognition (NER)and Part-of-Speech (POS) taggers gained traction [https://www.nature.com/articles/s41598-020-75544-1]. They main advantages of machine leaning de-identification methods are that they can be used “out of the box” with minimal development time and with only little knowledge of PHI patterns to a specific document types or domains. The main disadvantage of machine learning methods is the need for large amounts of annotated training data, although software tools can enhance this process [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6502465/]. Another disadvantage is, by its non-traceable nature, that it is difficult to know precisely why the application committed an error and how it can be debugged[https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-10-70].
 +
 +
 +
== Balance between data utility and data anonymity ==

Revision as of 01:17, 5 May 2024

The secondary use of health data is central to biomedical research. For instance, clinical trial datasets made available to wider scientific community after the end of the trial contains considerable amounts of data not analyzed as part of the published results and can be used for meta-analyses[1]. Given how time and resource intensive clinical data collection is, to not use the data fully is wasteful. In another example, there is ever growing need for health records to train deep learning health AI models. However, data availability is the first and foremost barrier to developing higher performing models that require large, annotated training datasets[2]. Health data is not only expensive, proprietary, and siloed, but heavily regulated to protect patient privacy. One way to circumvent the data availability issue is to generate high-fidelity synthetic electronic health records, but the scope of the latest research is limited to structured data[3]. Unstructured data is rich with insights, captures granular details, and comprises majority (around 80%) of EHR data[4]. De-identification or anonymization of health data is therefore a key and rapidly evolving sector of health data analytics. Conversely, to enable and support biomedical research has been cited as the purpose of de-identification and anonymization most frequently by the researchers of the field[5].

De-identification versus anonymization

The terms “de-identification” and “anonymization” are often vaguely and inconsistently defined in the research literature[6]. The proponents of differentiating the two terms argue that while de-identification refers to removal or replacement of personal identifiers so that reestablishing a link between the individual and the data is difficult, anonymization refers to irreversible removal of the link between the individual and the data so that reestablishing the link is virtually impossible. While de-identification required to comply with regulations like HIPAA are relatively easily adhered to by masking well-defined categories of data, it also carries a higher risk of re-identification[7]. Anonymization, on the other hand, requires further data manipulation and is shaped by the regional laws of where it takes place, which tend to be vague with their definitions[8].

Two strategies of de-identification

There are two general approaches to locating patient identifiers: lexical, pattern-based systems and machine learning-based systems, each with their advantages and disadvantages. Studies evaluating pattern-based systems have reported good performance (especially precision), but at the cost of months of work by domain experts and limited generalizability[9]. While this method requires little to no annotated training data and can be easily modified to improve performance by adding rules, terms or expressions, the terms, such complex rules are suited to a particular dataset and rarely carry over to different datasets[10]. More recently, various machine learning algorithms combined with natural language processing tasks such as Named-Entity Recognition (NER)and Part-of-Speech (POS) taggers gained traction [11]. They main advantages of machine leaning de-identification methods are that they can be used “out of the box” with minimal development time and with only little knowledge of PHI patterns to a specific document types or domains. The main disadvantage of machine learning methods is the need for large amounts of annotated training data, although software tools can enhance this process [12]. Another disadvantage is, by its non-traceable nature, that it is difficult to know precisely why the application committed an error and how it can be debugged[13].


Balance between data utility and data anonymity