Difference between revisions of "Health Record De-identification and Anonymization"

From Clinfowiki
Jump to: navigation, search
(Created page with "The secondary use of health data is central to biomedical research. For instance, clinical trial datasets made available to wider scientific community after the end of the tri...")
 
 
(3 intermediate revisions by one user not shown)
Line 1: Line 1:
 
The secondary use of health data is central to biomedical research. For instance, clinical trial datasets made available to wider scientific community after the end of the trial contains considerable amounts of data not analyzed as part of the published results and can be used for meta-analyses[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9373195/]. Given how time and resource intensive clinical data collection is, to not use the data fully is wasteful. In another example, there is ever growing need for health records to train deep learning health AI models. However, data availability is the first and foremost barrier to developing higher performing models that require large, annotated training datasets[https://pubmed.ncbi.nlm.nih.gov/34405854/]. Health data is not only expensive, proprietary, and siloed, but heavily regulated to protect patient privacy. One way to circumvent the data availability issue is to generate high-fidelity synthetic electronic health records, but the scope of the latest research is limited to structured data[https://www.nature.com/articles/s41746-023-00888-7]. Unstructured data is rich with insights, captures granular details, and comprises majority (around 80%) of EHR data[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10001457/]. De-identification or anonymization of health data is therefore a key and rapidly evolving sector of health data analytics. Conversely, to enable and support biomedical research has been cited as the purpose of de-identification and anonymization most frequently by the researchers of the field[https://www.jmir.org/2019/5/e13484/].
 
The secondary use of health data is central to biomedical research. For instance, clinical trial datasets made available to wider scientific community after the end of the trial contains considerable amounts of data not analyzed as part of the published results and can be used for meta-analyses[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9373195/]. Given how time and resource intensive clinical data collection is, to not use the data fully is wasteful. In another example, there is ever growing need for health records to train deep learning health AI models. However, data availability is the first and foremost barrier to developing higher performing models that require large, annotated training datasets[https://pubmed.ncbi.nlm.nih.gov/34405854/]. Health data is not only expensive, proprietary, and siloed, but heavily regulated to protect patient privacy. One way to circumvent the data availability issue is to generate high-fidelity synthetic electronic health records, but the scope of the latest research is limited to structured data[https://www.nature.com/articles/s41746-023-00888-7]. Unstructured data is rich with insights, captures granular details, and comprises majority (around 80%) of EHR data[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10001457/]. De-identification or anonymization of health data is therefore a key and rapidly evolving sector of health data analytics. Conversely, to enable and support biomedical research has been cited as the purpose of de-identification and anonymization most frequently by the researchers of the field[https://www.jmir.org/2019/5/e13484/].
 +
 +
== De-identification versus anonymization ==
 +
 +
The terms “de-identification” and “anonymization” are often vaguely and inconsistently defined in the research literature[https://www.jmir.org/2019/5/e13484/]. The proponents of differentiating the two terms argue that while de-identification refers to removal or replacement of personal identifiers so that reestablishing a link between the individual and the data is difficult, anonymization refers to irreversible removal of the link between the individual and the data so that reestablishing the link is virtually impossible. While de-identification required to comply with regulations like HIPAA are relatively easily adhered to by masking well-defined categories of data, it also carries a higher risk of re-identification[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6502465/]. Anonymization, on the other hand, requires further data manipulation and is shaped by the regional laws of where it takes place, which tend to be vague with their definitions[https://www.jmir.org/2019/5/e13484/].
 +
 +
== Two strategies of de-identification ==
 +
 +
There are two general approaches to locating patient identifiers: lexical, pattern-based systems and machine learning-based systems, each with their advantages and disadvantages. Studies evaluating pattern-based systems have reported good performance (especially precision), but at the cost of months of work by domain experts and limited generalizability[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6502465/]. While this method requires little to no annotated training data and can be easily modified to improve performance by adding rules, terms or expressions, the terms, such complex rules are suited to a particular dataset and rarely carry over to different datasets[https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-10-70]. More recently, various machine learning algorithms combined with natural language processing tasks such as Named-Entity Recognition (NER)and Part-of-Speech (POS) taggers gained traction [https://www.nature.com/articles/s41598-020-75544-1]. They main advantages of machine leaning de-identification methods are that they can be used “out of the box” with minimal development time and with only little knowledge of PHI patterns to a specific document types or domains. The main disadvantage of machine learning methods is the need for large amounts of annotated training data, although software tools can enhance this process [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6502465/]. Another disadvantage is, by its non-traceable nature, that it is difficult to know precisely why the application committed an error and how it can be debugged[https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-10-70].
 +
 +
== Balance between data utility and data anonymity ==
 +
 +
Data utility and data anonymity exist in a continuous spectrum. Several studies have shown that sensitive information cannot be thoroughly anonymized to suppress disclosure risk while still retaining full data utility; anonymization, by definition, corrupts data and results in some data loss, a phenomenon dubbed the curse of anonymization [https://medinform.jmir.org/2021/10/e29871/]. One of the more common privacy models is k-anonymity[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9815524/], a computational technique that aims to prevent the re-identification of individuals in a dataset by ensuring that each record is indistinguishable from at least k-1 other records[https://www.worldscientific.com/doi/abs/10.1142/S021848850200165X]. Improvements on k-anonymity model as well as numerous other privacy models have been proposed [https://medinform.jmir.org/2021/10/e29871/]. Commonly used anonymization techniques include[https://medinform.jmir.org/2021/10/e29871/]:
 +
*Perturbation: modifying the original data in a non-statistically significant fashion. Examples include microaggregation, data swapping, rank swapping, postrandomization, adding noise, and resampling
 +
*Generalization: reducing the specificity of the data (ex: dermataologist -> physician, Los Angeles -> California)
 +
*Suppression: replacing the observed categorical values of one or more variables with missing or NA or –.
 +
Multiple techniques can be combined to improve the anonymity vs utility balance, as in this example[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5858761/], in which explicit identifiers such as name were scrubbed, quasi identifiers such as admission/discharge date and hospital were clustered, and clinical details such as disease and medications were unchanged, to achieve greater performance than Safe Harbor method.
 +
 +
Submitted by Kyu Seo Kim
 +
[[Category:BMI512-SPRING-24]]

Latest revision as of 01:53, 5 May 2024

The secondary use of health data is central to biomedical research. For instance, clinical trial datasets made available to wider scientific community after the end of the trial contains considerable amounts of data not analyzed as part of the published results and can be used for meta-analyses[1]. Given how time and resource intensive clinical data collection is, to not use the data fully is wasteful. In another example, there is ever growing need for health records to train deep learning health AI models. However, data availability is the first and foremost barrier to developing higher performing models that require large, annotated training datasets[2]. Health data is not only expensive, proprietary, and siloed, but heavily regulated to protect patient privacy. One way to circumvent the data availability issue is to generate high-fidelity synthetic electronic health records, but the scope of the latest research is limited to structured data[3]. Unstructured data is rich with insights, captures granular details, and comprises majority (around 80%) of EHR data[4]. De-identification or anonymization of health data is therefore a key and rapidly evolving sector of health data analytics. Conversely, to enable and support biomedical research has been cited as the purpose of de-identification and anonymization most frequently by the researchers of the field[5].

De-identification versus anonymization

The terms “de-identification” and “anonymization” are often vaguely and inconsistently defined in the research literature[6]. The proponents of differentiating the two terms argue that while de-identification refers to removal or replacement of personal identifiers so that reestablishing a link between the individual and the data is difficult, anonymization refers to irreversible removal of the link between the individual and the data so that reestablishing the link is virtually impossible. While de-identification required to comply with regulations like HIPAA are relatively easily adhered to by masking well-defined categories of data, it also carries a higher risk of re-identification[7]. Anonymization, on the other hand, requires further data manipulation and is shaped by the regional laws of where it takes place, which tend to be vague with their definitions[8].

Two strategies of de-identification

There are two general approaches to locating patient identifiers: lexical, pattern-based systems and machine learning-based systems, each with their advantages and disadvantages. Studies evaluating pattern-based systems have reported good performance (especially precision), but at the cost of months of work by domain experts and limited generalizability[9]. While this method requires little to no annotated training data and can be easily modified to improve performance by adding rules, terms or expressions, the terms, such complex rules are suited to a particular dataset and rarely carry over to different datasets[10]. More recently, various machine learning algorithms combined with natural language processing tasks such as Named-Entity Recognition (NER)and Part-of-Speech (POS) taggers gained traction [11]. They main advantages of machine leaning de-identification methods are that they can be used “out of the box” with minimal development time and with only little knowledge of PHI patterns to a specific document types or domains. The main disadvantage of machine learning methods is the need for large amounts of annotated training data, although software tools can enhance this process [12]. Another disadvantage is, by its non-traceable nature, that it is difficult to know precisely why the application committed an error and how it can be debugged[13].

Balance between data utility and data anonymity

Data utility and data anonymity exist in a continuous spectrum. Several studies have shown that sensitive information cannot be thoroughly anonymized to suppress disclosure risk while still retaining full data utility; anonymization, by definition, corrupts data and results in some data loss, a phenomenon dubbed the curse of anonymization [14]. One of the more common privacy models is k-anonymity[15], a computational technique that aims to prevent the re-identification of individuals in a dataset by ensuring that each record is indistinguishable from at least k-1 other records[16]. Improvements on k-anonymity model as well as numerous other privacy models have been proposed [17]. Commonly used anonymization techniques include[18]:

  • Perturbation: modifying the original data in a non-statistically significant fashion. Examples include microaggregation, data swapping, rank swapping, postrandomization, adding noise, and resampling
  • Generalization: reducing the specificity of the data (ex: dermataologist -> physician, Los Angeles -> California)
  • Suppression: replacing the observed categorical values of one or more variables with missing or NA or –.

Multiple techniques can be combined to improve the anonymity vs utility balance, as in this example[19], in which explicit identifiers such as name were scrubbed, quasi identifiers such as admission/discharge date and hospital were clustered, and clinical details such as disease and medications were unchanged, to achieve greater performance than Safe Harbor method.

Submitted by Kyu Seo Kim