Difference between revisions of "Data Lake"
(→Data Ingestion) |
(→Data Storage and Retention) |
||
Line 11: | Line 11: | ||
==Data Storage and Retention== | ==Data Storage and Retention== | ||
− | + | The purpose of data lake storage is to have a central repository of various datatypes, structured and unstructured. In the same location, image files and vital signs can be stored. Advantages to a cloud based storage tool, such as AWS or Azure, include cheaper longer term storage and the ability to easily scale as more space is required. | |
+ | |||
+ | Tools | ||
+ | *Apache Hive [https://hive.apache.org/] | ||
+ | *Apache Hbase [https://hbase.apache.org/] | ||
+ | *MapR-DB [https://docs.datafabric.hpe.com/51/MapROverview/c_maprdb_new.html] | ||
+ | *Azure Data Lake Storage [https://azure.microsoft.com/en-us/services/storage/data-lake-storage/] | ||
+ | |||
==Data Processing== | ==Data Processing== | ||
*Tools | *Tools |
Revision as of 18:05, 27 October 2020
A data lake is a central repository that allows the storage and flow of structured and unstructured data sources. This concept is akin to a lake with multiple streams or sources to fill up a reservoir and store data as is, before it is allowed to flow out to various applications within an organization.
Contents
Functions of a Data Lake
Data Ingestion
The purpose of this step is to setup an ingestion pipeline for the various different data that is required to be stored. Within the clinical context, this may include the text of all clinical encounters from the EHR, the imaging data from the PACS system, the laboratory observations. The potential for creating an HL7 streaming data pipeline.
Streaming Data Tools
Data Storage and Retention
The purpose of data lake storage is to have a central repository of various datatypes, structured and unstructured. In the same location, image files and vital signs can be stored. Advantages to a cloud based storage tool, such as AWS or Azure, include cheaper longer term storage and the ability to easily scale as more space is required.
Tools
Data Processing
- Tools
Data Access
- Tools
Difference from Data Warehouse
A data warehouse itself is a database for relational data, where the data is extracted, cleaned, and transformed prior to being stored in a pre-defined schema. This data is optimized for fast SQL queries.
A data lake stores the raw data, both relational and non-relational data sources, without having to fit it within the constraints of a single database schema. Depending on the analytics required from various areas of the organization, the extract, transform, and load steps are performed within the data lake and distributed to the client depending on their needs[1]. In the clinical setting, this allows for storage of free text progress notes, laboratory observations, and imaging data to be all stored in the same central location, but can be used and analyzed together.
Data Swamp
This is when a data lake can become unruly and become a data swamp.
References
- ↑ Holmes DE. Big data [Internet]. Amazon. Oxford University Press; 2017 [cited 2020Oct26]. Available from: https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
Submitted by Tom Nahass