Mining Electronic Health Record Data

From Clinfowiki
Jump to: navigation, search

Introduction

Data mining is a means of leveraging very large data sets (Big Data) to discover meaningful information. In general, it is used to find patterns that are difficult to see using traditional data analysis techniques. Data mining uses techniques from artificial intelligence and statistics, especially machine learning.

Traditional database reports or queries only report what is in the database to the users. Online analytical processing (OLAP) can be used by an analyst to test hypothesis however the analysis must first formulate the hypothesis (3). The advantage of data mining techniques over OLAP is the ability to find patterns in the data without formulating an initial hypothesis, possibly finding relationships that an analyst would never have considered.

Data mining requires large databases or data warehouses which generally require new storage technologies including Hadoop, NoSQL, and NewSQL.

In electronic health records, data mining has been used to determine the effectiveness of treatment, find patterns among similar patients, interpret genomic data, and optimize clinical decision support. (2,4,6,7)

Types of Data

Generally speaking, data mining requires a "flat file" or vector data format rather than relational tables or objects (4). This means that all the data for each case of observed values appears as a single record rather than as a normalized relational table. Data can be numerical or categorical (both ordinal or nominal).

The Meaningful Use program has led to the collection of a Common Clinical Data Set (CCDS) across most providers, and this data is generally available in EHR. Common clinical information includes demographic information, diagnosis, problem lists, family history, allergies, immunizations, medications, procedures, lab values and orders, vital signs, radiology and other reports, cost and billing, genetic information, and social data among many others. Data may be in the form of free-form text (non semantic data) through continuity of care documents or notes, or it may be structured and semantic such as ICD diagnostic code data, Lab data, Drug information, digital images with radiology data. Data can also be administrative including scheduling or billing information.

Key data format standards exist for some of these data, but not all.

Patient identifiers: generally there is a unique patient ID (MRN) which may be connected to a health information exchange that is state or health network wide.

Diagnoses: International Classification of Diseases (ICD), Systematized Nomenclature of Medicine (SNOMED)

Medications: National Drug Codes (NDCs), RxNorm, Systematized Nomenclature of Medicine’s (SNOMED)

Procedures: International Classification of Diseases’ Clinical Modification (ICD-CM), Current Procedural Terminology (CPT), Healthcare Common Procedure Coding System (HCPCS)

Lab data: Logical Observation Identifiers Names and Codes (LOINC), Systematized Nomenclature of Medicine (SNOMED), Current Procedural Terminology (CPT)

Vital signs: Logical Observation Identifiers Names and Codes (LOINC)


Potential Applications

The ultimate purpose of data mining is to provide new information which can turn into new knowledge. This knowledge can be used to gain a competitive advantage, earn a greater profit, improve healthcare outcomes, provide better services, advance scientific knowledge, or increase efficiency of system processes.

It can be used to assess risk. A recent study by Berrouiguet et al. analysed a database of health records from 2802 suicide attempters to identify risk factors (2019). They used clustering methods to identify similar patients and regression trees to estimate number of suicide attempts. They then used this information to suggest clinical decision support.

Integrating genomics, that is, genomics as seen in the context of physiology, ecology, and evolution to better understand phenotype, has been a great poster child for data mining. Data mining in this context uses molecular data with deep learning, cluster analysis, gene-set enrichment analysis, random forest, among other approaches. Data mining has been used in this context to discover new function through mining genomic sequence data.

It can be used to identify an event. For example, Sundermann et al. verified the data mining approach by identifying correct route of transmission and assessing preventable cases in hospital outbreaks. They concluded that "Correct routes were identified for all outbreaks at the second patient, except for one outbreak involving >1 transmission route that was detected at the eighth patient. Up to 40 or 34 infections (78% or 66% of possible preventable infections, respectively) could have been prevented if data mining had been implemented in real time, assuming the initiation of an effective intervention within 7 or 14 days of identification of the transmission route, respectively." (Sundermann et al. 2019).

It has been used to fill in missing information in the record by analyzing patterns. Groenhof et al used data mining techniques to identify 1661 participants who did not have a smoking status records in the EHR. They compared the result to a questionnaire that these patients had all received and found that "Diagnostic accuracy for current smoking was sensitivity 88%, specificity 92%, NPV 98%, and PPV 63%. From false positives, 85% reported they had quit smoking at the time of the UCC" (Groenhof et al. 2019).

Mining using CCDs versus EHR Tables CCDs (Continuity of Care Documents) are vendor supplied, highly structured documents that offer a compact and convenient way to exchange information. They are generally based on the HLA-7 and ASTM International standard. They are useful for the explicit purpose that they were designed for, such as medication lists or problem lists. While these are convenient for these explicit purposes, EHR tables allow direct access data mining. This allows access to the full set of vitals, lab orders, notes, observations, diagnoses, and other data that can be analyzed. Furthermore, a CCD may be mapped only to a single template of information whereas the EHR table can map out all variations of data. (HealthITAnalytics 2020)


Challenges

The greatest challenges in data mining revolve around the interaction between people and information systems. Ownership of information is an evolving discussion that is not just limited to data mining. Information that is collected and analysis is not usually directly owned by the informatics departments who analyze the information (3). Ownership claims can be made by the patient, the EHR system, the clinician, and the data miner depending on context. Subsystems that are integrated into the EHR can also have claims.

Privacy is of major concern in this context. As massive amounts of data become available to study, the potential for massive data exposure increases. With great power comes great responsibility. Ethical concerns are ever prevalent. Data generated from critically ill patients during clinical care is almost always unconsented. Data mining itself requires data preparation which can compromise patient confidentiality and privacy (3). A common way of solving this problem is through data aggregation. Researchers are often stuck between wanting flexibility to analyze information and information security.

Challenges in data mining can also be technical. Data collecting formats can vary widely between vendors, states, and specialties (3). The health standards can be varied or nonexistent for some forms of data. Often, clinical context is lost which limits interpretation of the data values. Unstructured data formats such as progress note prose, still valued by clinicians, are difficult to store semantically.

Recent Advances in EHR Big Data Mining

Spatio-temporal Data Mining: as mobile devices with GPS and body position sensors become widely available, spatio-temporal data becomes more prevalent. This can be used in applications of public health including studying the effects of population movements through an epidemic (7).

mHealth: Patient-provided data in EHR is gaining traction. Patients can input information including home glucose reading, blood pressure measurements, problem lists, medications, demographic and social information, and communications data.

References

1. Berrouiguet, Sofian, et al. “An Approach for Data Mining of Electronic Health Record Data for Suicide Risk Management: Database Analysis for Clinical Decision Support.” JMIR Mental Health, vol. 6, no. 5, 7 May 2019, p. e9766, 10.2196/mental.9766. Accessed 25 Mar. 2020.

2. Darst, Burcu, et al. “Data Mining and Machine Learning Approaches for the Integration of Genome-Wide Association and Methylation Data: Methodology and Main Conclusions from GAW20.” BMC Genetics, vol. 19, no. S1, Sept. 2018, 10.1186/s12863-018-0646-3. Accessed 24 Apr. 2020.

3. Denaxas, Spiros C., et al. “The Tip of the Iceberg: Challenges of Accessing Hospital Electronic Health Record Data for Biological Data Mining.” BioData Mining, vol. 9, no. 1, 22 Sept. 2016, 10.1186/s13040-016-0109-1. Accessed 7 Dec. 2019.

4. Groenhof, T. Katrien J., et al. “Data Mining Information from EHRs Produced High Yield and Accuracy for Current Smoking Status.” Journal of Clinical Epidemiology, Nov. 2019, 10.1016/j.jclinepi.2019.11.006. Accessed 19 Nov. 2019.

5. HealthITAnalytics. “Using the EHR to Dive into Data Mining, Clinical Analytics.” HealthITAnalytics, 17 Sept. 2014, healthitanalytics.com/news/using-the-ehr-to-dive-into-data-mining-clinical-analytics. Accessed 24 Apr. 2020.

6. Lobb, Briallen, and Andrew C Doxey. “Novel Function Discovery through Sequence and Structural Data Mining.” Current Opinion in Structural Biology, vol. 38, June 2016, pp. 53–61, 10.1016/j.sbi.2016.05.017. Accessed 12 Mar. 2020.

7. Sundermann, Alexander J., et al. “Automated Data Mining of the Electronic Health Record for Investigation of Healthcare-Associated Outbreaks.” Infection Control & Hospital Epidemiology, vol. 40, no. 3, 18 Feb. 2019, pp. 314–319, 10.1017/ice.2018.343. Accessed 24 Apr. 2020.


Submitted by Atin Jindal ‌ --Jindala (talk) 19:12, 24 April 2020 (UTC)