Accessing primary care Big Data: the development of a software algorithm to explore the rich content of consultation records

From Clinfowiki
Jump to: navigation, search

This is a brief review of the article "Accessing primary care Big Data: the development of a software algorithm to explore the rich content of consultation records."[1]



To develop a natural language processing (NLP) software inference algorithm to classify the content of primary care consultations using electronic health record (EHR) Big Data and subsequently test the algorithm's ability to estimate the prevalence and burden of childhood respiratory illness in primary care.


Algorithm development and validation study. To classify consultations, the algorithm is designed to interrogate clinical narrative entered as free text, diagnostic (Read) codes created and medications prescribed on the day of the consultation.


Thirty-six consenting primary care practices from a mixed urban and semirural region of New Zealand. Three independent sets of 1200 child consultation records were randomly extracted from a data set of all general practitioner consultations in participating practices between January 1, 2008, and December 31, 2013, for children under 18 years of age (n=754,242). Each consultation record within these sets was independently classified by two expert clinicians as respiratory or non-respiratory, and subclassified according to respiratory diagnostic categories to create three "gold standard" sets of classified records. These three gold standard record sets were used to train, test and validate the algorithm.

Outcome Measures

Sensitivity, specificity, positive predictive value and F-measure were calculated to illustrate the algorithm's ability to replicate judgments of expert clinicians within the 1200 record gold standard validation set.


The algorithm was able to identify respiratory consultations in the 1200 record validation set with a sensitivity of 0.72 (95% CI 0.67 to 0.78) and a specificity of 0.95 (95% CI 0.93 to 0.98). The positive predictive value of algorithm respiratory classification was 0.93 (95% CI 0.89 to 0.97). The positive predictive value of the algorithm classifying consultations as being related to specific respiratory diagnostic categories ranged from 0.68 (95% CI 0.40 to 1.00; other respiratory conditions) to 0.91 (95% CI 0.79 to 1.00; throat infections).


A software inference algorithm that uses primary care Big Data can accurately classify the content of clinical consultations. This algorithm will enable accurate estimation of the prevalence of childhood respiratory illness in primary care and resultant service utilization. The methodology can also be applied to other areas of clinical care.


The large data set along with the large number of "gold standard" classifications support the very encouraging results from this fairly straightforward NLP algorithm. Sensitivity and specificity were especially impressive given the wide-range of writing styles among provider notes analyzed. Although retrospective classification as described in this study is both important and relevant, the authors' comments regarding the algorithm being integrated "into future versions of EHR software so that appropriate classification codes are suggested to clinicians in real time, thereby improving the quality and completeness of diagnostic coding" is especially exciting, given the wide range of possible clinical decision support CDS options available with concurrent classification.

Related Resources

Using natural language processing to identify problem usage of prescription opioids

Visualizing unstructured patient data for assessing diagnostic and therapeutic history


  1. 1.0 1.1 Macrae J, Darlow B, Mcbain L, et al. Accessing primary care Big Data: the development of a software algorithm to explore the rich content of consultation records. BMJ Open. 2015;5(8):e008160.