Data Warehouse Vs Data Lake vs Data Lakehouse

What is a Data warehouse?

A data warehouse is a type of data management system that is designed to support business intelligence (BI) activities. It often contains large amounts of historical data which is usually derived from a wide range of sources such as transactional systems, relational databases, and other sources, typically on a regular cadence.
Organizations build their data warehouse by using technologies like SQL Server, Oracle, Teradata, Redshift, PostgreSQL, etc.
Both the file format and the processing engine are typically proprietary technologies.

What is Data Lake?

A data lake is a distributed storage solution that runs on commodity hardware and easily scales out horizontally.
The data lake architecture, unlike that of the data warehouse, decouples the distributed storage system from the distributed computing system. This allows each system to scale out as needed by the workload.
Organizations build their data lakes by independently choosing the following:
- Storage system e.g., Azure Data Lake Storage, AWS S3, or Google Cloud Storage.
- File format e.g., Parquet, ORC, JSON, CSV etc.
- Processing engine e.g., Spark, Hive, Hadoop, HDInsight, Amazon EMR, etc.

What is Data Lakehouse?

A data lake house is a data solution concept that combines elements of the data warehouse with those of the data lake. Data Lake + ACID + Data Governance +Merge
Organizations build their data lake house by independently choosing the following:
- Storage system e.g., Azure Data Lake Storage, AWS S3, or Google Cloud Storage.
- File format e.g., Delta Lake, Iceberg, etc.
- Processing engine e.g., Azure Synapse, Databricks, Spark, Flink, etc.

Comparison

Feature	Data Warehouse	Data Lake	Data Lakehouse
Support for Scale-out	No	Yes	Yes
Support for diverse data format	No	Yes	Yes
Support for diverse workloads	No	Yes	Yes
Schema evolution	No	Yes	Yes
Support For ML	No	Yes	Yes
Support streaming processing	No	Yes	Yes
ACID Compliant	Yes	No	Yes
Support Data Governance	Yes	No	Yes
Support Merge (Update+ Delete) support.	Yes	No	yes
Support Indexing	Yes	No	Yes
Is Storage and Processing engine de-coupled	No	Yes	Yes
File Format	Proprietary	Parquet, ORC, JSON, CSV etc.	Delta Lake, Iceberg etc.
Processing engine	SQL Server, Oracle, Teradata, Redshift, etc.	Spark, Hive, Flink, Hadoop, HDInsight, Amazon EMR etc.	Azure Synapse, Databricks, Spark, Flink etc.

Databricks: The Data Lakehouse Platform for Dummies

7 comments

Vishal Attray says:

July 12, 2022 at 7:06 pm

Very nicely written….Short and crisp

1. parmsingla says:
  
  July 13, 2022 at 4:08 am
  
  Thanks, Vishal!!
  
Manoj Pandey says:

July 13, 2022 at 7:16 am

Nice writeup, and to the point details on DWH, DataLake, and DLH concepts with examples.

1. parmsingla says:
  
  July 15, 2022 at 9:24 am
  
  Thank you Manoj! I’m learning to be consistent from you. 🙂
  
Alok says:

July 14, 2022 at 9:12 pm

Very well explained.

1. parmsingla says:
  
  July 15, 2022 at 9:23 am
  
  Thanks Alok!
  
parmsingla says:

July 17, 2022 at 2:29 pm

Question:
Where do the NoSQL databases fit in? Eg. DynamoDB has all the characteristics like decoupled query and storage, document db etc. Is it a component of lakehouse and data lakes architecture or just one of them?
My take:
NoSQL databases like DynamoDB, and Cassandra are mainly used in OLTP workloads like shopping carts, customer profiles, gaming platforms, store user activities, etc. where users can live with eventual consistency. Normally high availability is expected from them.
When not to use DynamoDB:
· When multi-item or cross-table transactions are required
· When complex queries and joins are required
· When real-time analytics on historic data is required
Even the amazon blog says DynamoDB is not suitable for OLAP workload. Also, I haven’t heard of anyone using it for OLAP, but I’d be delighted to learn if someone is.
So, I don’t think DynamoDB fit in any of the categories described in the blog.

Data Warehouse Vs Data Lake vs Data Lakehouse

Leave a Reply Cancel reply