Secondary Use of EHR: Data Quality Issues and Informatics Opportunities

From Clinfowiki
Revision as of 19:18, 3 October 2015 by Tom4891 (Talk | contribs)

Jump to: navigation, search


This paper examines the data quality issues that arose during a retroactive study of the survival of pancreatic cancer patients. It examines the secondary use of electronic health record (EHR) systems and the data quality issues that were manifested, while discussing strategies in which emerging informatics technologies can help solve them. [1]

Data Source and Methods

This study made use of the Columbia University Medical Center’s clinical data warehouse in New York City. A 3-step procedure was used to identify the cases of pancreatic cancer in the data warehouse. Step 1: The use of ICD-9-CM code 157.0-157.9 corresponding to the diagnosis of malignant neoplasm of pancreas identifying patients between 01/01/1999 – 01/01/2009. Other pertinent data, reports, and notes were also extracted and analyzed. Step 2: A SQL query was first applied then a manual review of the query output to filter out non-malignancies or non-primary lesions. Step 3: Division of remaining patients into groups of endocrine and exocrine neoplasms, and further classified by disease subtype standards.


The use of ICD-9-CM code 157.0-157.9 coding garnered 3068 patients between 01/01/1999 – 01/01/2009. Of these patients, 1479 (48%) did not have corresponding diagnoses or disease documentation in the pathology reports following a query. The remaining 1589 (52%) patients were further reduced to 522 (17%) due to incompleteness in the key study variable that defines the disease stage. Three main problems were identified during the study. Significant information incompleteness (missing information) was noted in many study variables and excluded from further analysis for variables of more than 50% incompleteness. Also greatly observed was information inconsistency and inaccuracy.


The problems encountered during the study (information incompleteness, inconsistency, and inaccuracy) are common challenges in many institutions, and not unique to the data warehouse used. To solve the problems with clinical data warehouses, new technology for storage and natural language processing are needed. Suggestions were made to combine text mining tools and post processing to improve data retrieval, while noting that improving the quality of the data collected is much needed.


As more institutions move toward clinical data warehousing, steps should be taken to develop new methods and technologies for clinical analytics. It is also pointed out that personal health record (PHR), clinical registry, and health information exchange will be the key component in enabling technologies in improving EHR data quality.

My Comments

This study clearly depicts the necessity for efficient data mining systems. Secondary use of clinical data or EHR, as earlier stated, is an essential part of the healthcare system, and further research must be performed in order to develop advanced systems or technologies that are able to solve the problems mentioned in the study. My only criticism or questioning regarding the study is the fact that multiple data warehouses weren’t used. I believe that would garnered greater clinical data, and a comparison could have been made amongst the various data warehouses used regarding the problems encountered. In that respect, more accurate or pertinent suggestions toward improvement of data quality could have been made.


  1. Botsis, T., Hartvigsen, G., Chen, F., & Weng, C. (2010). Secondary Use of EHR: Data Quality Issues and Informatics Opportunities