Data and Databases

From Clinfowiki
Jump to: navigation, search

In an Electronic Health Record (EHR) data are facts and information about patients, procedures, histories, results, and other processes. Data can come from a verity of sources such as clinical documentation, imported from various modules (such as lab results), or even externally as entered or reported by patients. They can appear as numbers, codes, short phrases, sentences, and paragraphs. These data maintain the patient’s health record and assist in diagnosis and treatment.

Data Categories

Structured Data

These are data that are entered into the EHR in predefined fields and formats that can be readily searched, used, and be analyzed by computer systems. At the expense of time and effort of the entering provider these data are essential for data processing used in areas including research and clinical decision support. Examples of structured data includes billing, laboratory results, vital signs, list of medications, and list of diagnostic codes. (See Structured data entry [1])

Unstructured Data

Data entered or stored in a manner not easily accessible or quantifiable by a computer system. Free text, images, scanned documents, are examples of unstructured data. Such data can potentially be extracted and converted to structured formats by using technology such as Natural Language Processing, but this a difficult process and susceptible to errors. (See NLP [2])

Big Data

Big data combines structured and unstructured data; however, it is distinguished as being difficult to process due to huge volumes, complexity, and sometimes continuous data importation. (See Big Data [3])

Generally, structured data entry is the preferred method for capturing usable information. However, there has always been a tension between the needs of healthcare providers who value flexibility and time-saving measures which standardized structured data may not allow for.[1]

Data Types

Variables stored by programming languages have different types depending on their content and constraints. Common data types include:

Numbers: defined as integers or numbers with decimals (float)

Letters: single characters or strings (text)

Dates and times: stored temporal data

Lists and collections: groups of numbers or letters

Binary data: information such as image data transported to a specialized software. [2]

Data Formats / Representations

When data is being transferred or processed they often come in three common organizational formats: XML, JSON, or CSV.

XML (Extensible Markup Language) uses tags which are text within brackets to separate pieces of the document. Embedded tags allow for hierarchy of data. (See XML [4])

JSON (JavaScript Object Notation) is a format that uses colons, commas, and tabs instead of brackets, making it easier to read by humans. Comparable samples can be found here: [5]

CSV (Comma-separated values) is a nonhierarchical structured format that represents data as a spreadsheet with columns and rows. Commas separate columns and lines separate rows.

FHIR (Fast Healthcare Interoperability Resources) uses idiomatic XML and JSON to serialize resources.[3] provides free public access to many datasets in above formats.

Storing Data: Databases

A database is a means of storing information in such a way that information can be retrieved from it.[4]

Relational Databases are the gold standard of database storage and is also known as SQL databases. SQL stands for Structured Query Language which was released as SQL 92 in 1992 and has been used as the standard since then with modifications made to create different platforms such as Microsoft SQL Server and Oracle.

A relational database can simply be thought of tables with rows and columns. Tables are collection of objects of the same type (rows) with predefined sets of columns define a relation. These tables can also be dynamically joined using relationships between them (unlike a simple spreadsheet). The structure of the database tables is known as the database schema. Schema is normalized which prevents duplicated entries. For instance, a data table with patient’s date of birth can be joined with a data table holding information on the date of encounter without duplicating any of the data. A Database Management System (DBMS) handles the way data is stored, maintained, and retrieved. In the case of a relational database, a Relational Database Management System (RDBMS) performs these tasks.

Non-relational Databases

Massachusetts General Hospital Utility Multi-Programming System (MUMPS)

Developed in 1966 prior to the existence of relational databases. It is still widely used in clinical informatics as many healthcare systems and EHR vendors use it including Allscripts, MEDITECH, EPIC, the VA system, and a large part of the DOD. MUMPS is both a programming language and a database. As a database, its data are stored in matrices and not tables. It is very efficient at complex data manipulation and generally cheaper to run. However, due to its age and advent of relational databases there are not many programmers or support for using it.[5]


An algorithm originally developed by Google that allows optimized querying in massively parallel environments. This allows for many computers to execute portions of queries at the same time as each query can be split into smaller subtasks. With appropriate hardware parallel complex computing can reduce communication cost. Hadoop is a popular example of an open source MapReduce database.

Using Data in Healthcare and Beyond

Health information systems (HIS) are clinical systems that capture, store, manage, or transmit information related to the health of individuals. Structured data are presented in easy-to-understand formats such as flow sheets. The data allows for utilization of power clinical decision support systems such as health maintenance reminders. Databases allow chart data to be searched, and creating a document can be easier where certain elements auto-populate for the provider. (See [6])

Various initiatives to share information across health systems are collectively called Health Information Exchange (HIE) which include transferring a single patient’s records and on a larger scale aggregating patient data across hospital systems. (See [7])

Knowledge Discovery and Data Mining (KDDM)

This is an interdisciplinary area focusing upon statistical methodologies for extracting useful knowledge from data that is not otherwise obvious.[6]

Searching a patient’s electronic chart can be simple KDDM, but it can be used for more complex tasks such as predictive analytics like predicting hospital readmission risk. Additionally, KDDM can learn patterns and relationships in data without a specific goal. This unsupervised learning looks for patterns or relationships in data where there is no specific goal. A popular example of unsupervised learning is recommendation algorithm from Amazon which provides personalized recommendations, music or playlists to purchase based on a customer’s previous purchase history.[7]


  1. Rosenbloom ST, Denny JC, Xu H, Lorenzi N, Stead WW, Johnson KB. Data from clinical notes: A perspective on the tension between structure and flexible documentation. Journal of the American Medical Informatics Association: JAMIA. 2011;18(2):181-186. doi: 10.1136/jamia.2010.007237.
  2. Finnell JT, Dixon BE. Clinical informatics study guide: Text and review. 1st ed. 2016 ed. Cham: Springer; 2015. 10.1007/978-3-319-22753-5.
  3. Mandel JC, Kreda DA, Mandl KD, Kohane IS, Ramoni RB. SMART on FHIR: A standards-based, interoperable apps platform for electronic health records. J Am Med Inform Assoc. 2016;23(5):899-908. doi: 10.1093/jamia/ocv189.
  5. Posted by William Vorhies on January 28, 2016 at 9:00am, Blog V. MUMPS – the most important database you (probably) never heard of.
  6. Knowledge discovery and data mining - IBM. Updated 2016.
  7. Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (eds.). The adaptive web: Methods and strategies of web personalization. Vol 4321. New York: Springer; 2007.