Coronary artery disease risk assessment from unstructured electronic health records using text mining
This is a review of the article by Jonnagaddala et al., Coronary artery disease risk assessment from unstructured electronic health records using text mining.
Coronary artery disease (CAD) often leads to myocardial infarction, which may be fatal. Risk factors can be used to predict CAD, which may subsequently lead to prevention or early intervention. Patient data such as co-morbidities, medication history, social history and family history are required to determine the risk factors for a disease. However, risk factor data are usually embedded in unstructured clinical narratives if the data is not collected specifically for risk assessment purposes. Clinical text mining can be used to extract data related to risk factors from unstructured clinical notes. This study presents methods to extract Framingham risk factors from unstructured electronic health records using clinical text mining and to calculate 10-year coronary artery disease risk scores in a cohort of diabetic patients. We developed a rule-based system to extract risk factors: age, gender, total cholesterol, HDL-C, blood pressure, diabetes history and smoking history. The results showed that the output from the text mining system was reliable, but there was a significant amount of missing data to calculate the Framingham risk score. A systematic approach for understanding missing data was followed by implementation of imputation strategies. An analysis of the 10-year Framingham risk scores for coronary artery disease in this cohort has shown that the majority of the diabetic patients are at moderate risk of CAD.
Background and Purpose
The paper focuses on using text mining to extract the information necessary for calculating a patient’s risk for coronary artery disease (CAD) from unstructured notes in electronic health records. There are different models that are used to predict a patient’s risk for CAD but the most popular model is the Framingham risk score model (FRS)[]. The factors FRS uses are age, gender, total cholesterol, or low-density lipoproteins cholesterol (LDL-C), high-density lipoproteins cholesterol (HDL-C), blood pressure (BP), diabetes history and smoking history. All of which can be found in unstructured text in electronic health records. Available FRS calculators online require manual entry of the values for these factors using structured means. For a relatively low number of patients, this isn’t a problem but when dealing with a high number of patients, it can become an extremely overwhelming task. Since all of the necessary information is available in unstructured texts of an electronic health record, it makes sense to employ a system that uses text mining to extract the values and calculate a patient’s risk factor for CAD using the FRS model.
The system they developed was rule-based and the factors it extracted were used to calculate a 10 year FRS model CAD risk for patients.The CAD risk scores generated using the system was consistent with the scores calculated manually for 20 patients using the official 10-year CAD FRS worksheet. A major issue they encountered had to do with missing information but they were able to mitigate this issue by applying appropriate imputation methods such as assigning the cohort mean value for factors without value.
Two reasons were identified that explained why some factors had missing values; a) the text mining system failed to recognize the risk factor data and (b) the risk factor data was not recorded . The latter reason proved to be the reason for a majority of the factors missing values, after manual inspection ..
Add next review here.
- Jonnagaddala, J., Liaw, S.-T., Ray, P., Kumar, M., Chang, N.-W., & Dai, H.-J. (2015). Coronary artery disease risk assessment from unstructured electronic health records using text mining. Journal of Biomedical Informatics. http://doi.org/10.1016/j.jbi.2015.08.003