Data Lake

From Clinfowiki
Jump to: navigation, search

A data lake is a central repository that allows the storage and flow of structured and unstructured data sources. This concept is akin to a lake with multiple streams or sources to fill up a reservoir and store data as is, before it is allowed to flow out to various applications within an organization.

Functions of a Data Lake

Data Ingestion

The purpose of this step is to setup an ingestion pipeline for the various different data that is required to be stored. Within the clinical context, this may include the text of all clinical encounters from the EHR, the imaging data from the PACS system, the laboratory observations. The potential for creating an HL7 streaming data pipeline.

Tools

  • Apache Flume [1]
  • Apache Kafka [2]
  • Apache Storm [3]

Data Storage and Retention

The purpose of data lake storage is to have a central repository of various datatypes, structured and unstructured. In the same location, image files and vital signs can be stored. Advantages to a cloud based storage tool, such as AWS or Azure, include cheaper longer term storage and the ability to easily scale as more space is required.

Tools

  • Apache Hive [4]
  • Apache Hbase [5]
  • MapR-DB [6]
  • Azure Data Lake Storage [7]
  • AWS Data Lake Storage [8]

Data Processing

A number of distributed data processing tools, largely based off of the Hadoop/Spark frameworks exist to run extract, load, transform tools to clean the data for the downstream users. During this phase, a number of transformations can take place, including machine learning pipelines to perform analysis of the various types of data.

Tools

  • MapReduce - a programming model and an associated implementation for processing big data [9]
  • Apache Hive - a data warehouse system [10]
  • Apache Spark - a unified analytics engine for large-scale data processing[11]
  • Apache Storm - a open source distributed real-time computation system [12]
  • Apache Drill - a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage [13]

Data Access

These are the streams that flow out from the data lake into the various end users. Similar to data mart output from a data warehouse, this access control is the final product for any dashboard or analytic work that may be necessary. This will be done on the processed data so the raw data is left intact. Depending on the department, various tools and workflows can be created to allow the free flow of data.

Tools

  • Qlik
  • Tableau
  • Spotfire
  • REST APIs
  • Apache Kafka
  • Database Query Engines

Considerations

Whether to implement this as a solution for an organization largely depends on the needs and use cases of the data that is available and necessary to store. Clinical data from a variety of sources is a good example of the multiple datatypes that would need to be stored and eventually analyzed. Different departments would like different tools for analyzing their specific cohort of patients.


Whether to store this data as a local data lake or on a cloud platform largely depends on the existing infrastructure, the capability of IT backend workers, and the cost to the organization. There are management platforms as a service such as Databricks that allow an organization to get a system up and running quickly. Most of these PAAS are based on open source tools as noted above, but the cost of maintaining these distributed clusters can become difficult quickly.


Another consideration within the healthcare sector is HIPAA and data privacy/security. Access control, encryption, and data obfuscation are built into PAAS whereas using a locally developed implementation would need to have its own security maintained.

Difference from Data Warehouse

A data warehouse itself is a database for relational data, where the data is extracted, cleaned, and transformed prior to being stored in a pre-defined schema. This data is optimized for fast SQL queries.

A data lake stores the raw data, both relational and non-relational data sources, without having to fit it within the constraints of a single database schema. Depending on the analytics required from various areas of the organization, the extract, transform, and load steps are performed within the data lake and distributed to the client depending on their needs[1]. In the clinical setting, this allows for storage of free text progress notes, laboratory observations, and imaging data to be all stored in the same central location, but can be used and analyzed together.

Data Swamp

The lack of governance and organization in data sources without adequate metadata makes the raw data almost impossible to use. If the storage becomes too unruly, it has been said this becomes a data swamp[2]. Specific access control policies and specific metadata for each source will help to ensure that the lake stays clean.

Clinical Implementations

A Mississippi public health application to aggregate and store disparate sources of health data was created to track and examine the health workforce across the state. Their goals were to:

  1. Build a centralized data repo that was scalable
  2. Create a data management solution for this repo and finally
  3. Derive value by facilitating access to visualization and analysis of this data [3] This was a largely successful implementation of a data lake with various data sources. Although this did not involve patient level data, various other informatics departments have implemented this for their hospital systems.

References

  1. Holmes DE. Big data [Internet]. Amazon. Oxford University Press; 2017 [cited 2020Oct26]. Available from: https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
  2. Gartner Says Beware of the Data Lake Fallacy [Internet]. Gartner. [cited 2020Oct27]. Available from: https://www.gartner.com/en/newsroom/press-releases/2014-07-28-gartner-says-beware-of-the-data-lake-fallacy
  3. Krause DD. Data Lakes and Data Visualization: An Innovative Approach to Address the Challenges of Access to Health Care in Mississippi. Online J Public Health Inform. 2015;7(3):e225. Published 2015 Dec 30. doi:10.5210/ojphi.v7i3.6047

Submitted by Tom Nahass