Data Warehouse Vs Data Lake vs Data Lakehouse
What is a Data warehouse?
- A data warehouse is a type of data management system that is designed to support business intelligence (BI) activities. It often contains large amounts of historical data which is usually derived from a wide range of sources such as transactional systems, relational databases, and other sources, typically on a regular cadence.
- Organizations build their data warehouse by using technologies like SQL Server, Oracle, Teradata, Redshift, PostgreSQL, etc.
- Both the file format and the processing engine are typically proprietary technologies.
What is Data Lake?
- A data lake is a distributed storage solution that runs on commodity hardware and easily scales out horizontally.
- The data lake architecture, unlike that of the data warehouse, decouples the distributed storage system from the distributed computing system. This allows each system to scale out as needed by the workload.
- Organizations build their data lakes by independently choosing the following:
- Storage system e.g., Azure Data Lake Storage, AWS S3, or Google Cloud Storage.
- File format e.g., Parquet, ORC, JSON, CSV etc.
- Processing engine e.g., Spark, Hive, Hadoop, HDInsight, Amazon EMR, etc.
What is Data Lakehouse?
- A data lake house is a data solution concept that combines elements of the data warehouse with those of the data lake. Data Lake + ACID + Data Governance +Merge
- Organizations build their data lake house by independently choosing the following:
- Storage system e.g., Azure Data Lake Storage, AWS S3, or Google Cloud Storage.
- File format e.g., Delta Lake, Iceberg, etc.
- Processing engine e.g., Azure Synapse, Databricks, Spark, Flink, etc.
Comparison
Feature | Data Warehouse | Data Lake | Data Lakehouse |
Support for Scale-out | No | Yes | Yes |
Support for diverse data format | No | Yes | Yes |
Support for diverse workloads | No | Yes | Yes |
Schema evolution | No | Yes | Yes |
Support For ML | No | Yes | Yes |
Support streaming processing | No | Yes | Yes |
ACID Compliant | Yes | No | Yes |
Support Data Governance | Yes | No | Yes |
Support Merge (Update+ Delete) support. | Yes | No | yes |
Support Indexing | Yes | No | Yes |
Is Storage and Processing engine de-coupled | No | Yes | Yes |
File Format | Proprietary | Parquet, ORC, JSON, CSV etc. | Delta Lake, Iceberg etc. |
Processing engine | SQL Server, Oracle, Teradata, Redshift, etc. | Spark, Hive, Flink, Hadoop, HDInsight, Amazon EMR etc. | Azure Synapse, Databricks, Spark, Flink etc. |
Databricks: The Data Lakehouse Platform for Dummies
Very nicely written….Short and crisp
Thanks, Vishal!!
Nice writeup, and to the point details on DWH, DataLake, and DLH concepts with examples.
Thank you Manoj! I’m learning to be consistent from you. 🙂
Very well explained.
Thanks Alok!
Question:
Where do the NoSQL databases fit in? Eg. DynamoDB has all the characteristics like decoupled query and storage, document db etc. Is it a component of lakehouse and data lakes architecture or just one of them?
My take:
NoSQL databases like DynamoDB, and Cassandra are mainly used in OLTP workloads like shopping carts, customer profiles, gaming platforms, store user activities, etc. where users can live with eventual consistency. Normally high availability is expected from them.
When not to use DynamoDB:
· When multi-item or cross-table transactions are required
· When complex queries and joins are required
· When real-time analytics on historic data is required
Even the amazon blog says DynamoDB is not suitable for OLAP workload. Also, I haven’t heard of anyone using it for OLAP, but I’d be delighted to learn if someone is.
So, I don’t think DynamoDB fit in any of the categories described in the blog.