Secondary Use of EHR: Data Quality Issues and Informatics Opportunities
This paper examines the data quality issues that arose during a retroactive study of the survival of pancreatic cancer patients. It examines the secondary use of electronic health record (EHR) systems and the data quality issues that were manifested, while discussing strategies in which emerging informatics technologies can help solve them. 
Data Source and Methods
This study made use of the Columbia University Medical Center’s clinical data warehouse in New York City. A 3-step procedure was used to identify the cases of pancreatic cancer in the data warehouse. Step 1: The use of ICD-9-CM code 157.0-157.9 corresponding to the diagnosis of malignant neoplasm of pancreas identifying patients between 01/01/1999 – 01/01/2009. Other pertinent data, reports, and notes were also extracted and analyzed. Step 2: A SQL query was first applied then a manual review of the query output to filter out non-malignancies or non-primary lesions. Step 3: Division of remaining patients into groups of endocrine and exocrine neoplasms, and further classified by disease subtype standards.
The use of ICD-9-CM code 157.0-157.9 coding garnered 3068 patients between 01/01/1999 – 01/01/2009. Of these patients, 1479 (48%) did not have corresponding diagnoses or disease documentation in the pathology reports following a query. The remaining 1589 (52%) patients were further reduced to 522 (17%) due to incompleteness in the key study variable that defines the disease stage. Three main problems were identified during the study. Significant information incompleteness (missing information) was noted in many study variables and excluded from further analysis for variables of more than 50% incompleteness. Also greatly observed was information inconsistency and inaccuracy.
The problems encountered during the study (information incompleteness, inconsistency, and inaccuracy) are common challenges in many institutions, and not unique to the data warehouse used. To solve the problems with clinical data warehouses, new technology for storage and natural language processing are needed. Suggestions were made to combine text mining tools and post processing to improve data retrieval, while noting that improving the quality of the data collected is much needed.
As more institutions move toward clinical data warehousing, steps should be taken to develop new methods and technologies for clinical analytics. It is also pointed out that personal health record (PHR), clinical registry, and health information exchange will be the key component in enabling technologies in improving EHR data quality.
This study clearly depicts the necessity for efficient data mining systems. Secondary use of clinical data or EHR, as earlier stated, is an essential part of the healthcare system, and further research must be performed in order to develop advanced systems or technologies that are able to solve the problems mentioned in the study. My only criticism or questioning regarding the study is the fact that multiple data warehouses weren’t used. I believe that would garnered greater clinical data, and a comparison could have been made amongst the various data warehouses used regarding the problems encountered. In that respect, more accurate or pertinent suggestions toward improvement of data quality could have been made.
Electronic health records are used increasingly for many health service and clinical research. Currently, the potential of EHR for medical research is not fully realized and secondary use of EHR is still at its early stage. The author of this paper reported their experience with some data quality issues in a survival analysis for pancreatic cancer. They identified the major quality issues and their manifestations and also discussed the opportunities for health information technology to alleviate those data quality issues. 
Data Source and Methods
The Columbia University Medical Center’s clinical warehouse is used for this study. This data warehouse used a controlled clinical vocabulary, the Medical Entities Dictionary to integrate data from different hospital information systems. They used 3-step procedure to identify a cohort of pancreatic cancer cases from the data collected between 1999 and 2009: 1)Use ICD-9-CM code to identify pancreatic cancer cases, 2) Use SQL to exclude patients who do not have adequate documentation to be diagnosed as pancreatic cancer, and 3) Divided the remaining patient into endocrine and exocrine neoplasms. They manually abstracted and automatically extracted specific pathology characteristics and applied the three common measurements of data quality: incompleteness, inconsistency and inaccuracy.
Of 3068 patient who had ICD-9-CM diagnosis for pancreatic cancer, only 1589 had corresponding disease documentation in pathology reports. Incompleteness is the leading data quality issue. After excluding the study variables of more than 50% incompleteness, the degree of incompleteness was between 0% and 44% for endocrine pancreatic tumors. The degree of information incompleteness was higher in the later stage ductal adenocarcinomas (many having more than 50% degree of information incompleteness). Information inconsistency occurred either between different EHR data sources (different components of EHR such as clinical notes, drug registry etc.)or within the same EHR data sources. Information inaccuracy such as poor granularity of the diagnosis terms or disease classification codes and inadequate or non-standardized documentation of disease status or treatment details.
Information incompleteness, inconsistency and inaccuracy are common challenges for many other clinical data warehouse. New technology for storage and new methods for natural language processing, extended SQL will help with the accessibility, availability and computability of health data. The authors suggest combing text mining tools and special post processing to facilitate information retrieval. Text mining tools should be based on a source- and domain-specific lexicon. User involvement is a key for improving the quality of data. Information incompleteness caused by information fragmentation could be mitigated using health information exchange (HIE) methods. Information incompleteness due to poor documentation would benefit from proactive documentation support. The authors described personal health records (PHR) and clinical registries could offer potential solutions for data quality issues.
PHR, clinical registry and health information exchange will be the key for improving EHR data quality.
This paper reported the experience with re-use of EHR data for a cohort of pancreatic cancer patients survival analysis and identified the major data quality issues: information incompleteness, inconsistency and inaccuracy. Those are common challenges faced when re-using EHR data. The authors suggested new technologies for data-mining, health information exchange, PHR and clinical registries as solutions or potential solutions. They also pointed out user involvement is key for improving data quality, which I think it is much needed. Proactive documentation support will improve the comprehensiveness of health data. Because the authors did not provide more details, I am curious how clinical registries will improve EHR data and how the data for clinical registries would be collected.
- The rise of big clinical databases
- Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research
- Coronary artery disease risk assessment from unstructured electronic health records using text mining
- Botsis, T., Hartvigsen, G., Chen, F., & Weng, C. (2010). Secondary Use of EHR: Data Quality Issues and Informatics Opportunities http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041534/