Data Warehouse Vs Data Lake vs Data Lakehouse

Data Warehouse Vs Data Lake vs Data Lakehouse

What is a Data warehouse?

  • A data warehouse is a type of data management system that is designed to support business intelligence (BI) activities. It often contains large amounts of historical data which is usually derived from a wide range of sources such as transactional systems, relational databases, and other sources, typically on a regular cadence.
  • Organizations build their data warehouse by using technologies like SQL Server, Oracle, Teradata, Redshift, PostgreSQL, etc.
  • Both the file format and the processing engine are typically proprietary technologies.

What is Data Lake?

  • A data lake is a distributed storage solution that runs on commodity hardware and easily scales out horizontally.
  • The data lake architecture, unlike that of the data warehouse, decouples the distributed storage system from the distributed computing system. This allows each system to scale out as needed by the workload.
  • Organizations build their data lakes by independently choosing the following:
    • Storage system e.g., Azure Data Lake Storage, AWS S3, or Google Cloud Storage.
    • File format e.g., Parquet, ORC, JSON, CSV etc.
    • Processing engine e.g., Spark, Hive, Hadoop, HDInsight, Amazon EMR, etc.

What is Data Lakehouse?

  • A data lake house is a data solution concept that combines elements of the data warehouse with those of the data lake. Data Lake + ACID + Data Governance +Merge
  • Organizations build their data lake house by independently choosing the following:
    • Storage system e.g., Azure Data Lake Storage, AWS S3, or Google Cloud Storage.
    • File format e.g., Delta Lake, Iceberg, etc.
    • Processing engine e.g., Azure Synapse, Databricks, Spark, Flink, etc.

Comparison

Feature Data Warehouse Data Lake Data Lakehouse
Support for Scale-out No Yes Yes
Support for diverse data format No Yes Yes
Support for diverse workloads No Yes Yes
Schema evolution No Yes Yes
Support For ML No Yes Yes
Support streaming processing No Yes Yes
ACID Compliant Yes No Yes
Support Data Governance Yes No Yes
Support Merge (Update+ Delete) support. Yes No yes
Support Indexing Yes No Yes
Is Storage and Processing engine de-coupled No Yes Yes
File Format Proprietary Parquet, ORC, JSON, CSV etc. Delta Lake, Iceberg etc.
Processing engine SQL Server, Oracle, Teradata, Redshift, etc. Spark, Hive, Flink, Hadoop, HDInsight, Amazon EMR etc. Azure Synapse, Databricks, Spark, Flink etc.

Databricks: The Data Lakehouse Platform for Dummies

 

7 comments

  1. Question:
    Where do the NoSQL databases fit in? Eg. DynamoDB has all the characteristics like decoupled query and storage, document db etc. Is it a component of lakehouse and data lakes architecture or just one of them?
    My take:
    NoSQL databases like DynamoDB, and Cassandra are mainly used in OLTP workloads like shopping carts, customer profiles, gaming platforms, store user activities, etc. where users can live with eventual consistency. Normally high availability is expected from them.
    When not to use DynamoDB:
    · When multi-item or cross-table transactions are required
    · When complex queries and joins are required
    · When real-time analytics on historic data is required
    Even the amazon blog says DynamoDB is not suitable for OLAP workload. Also, I haven’t heard of anyone using it for OLAP, but I’d be delighted to learn if someone is.
    So, I don’t think DynamoDB fit in any of the categories described in the blog.

Leave a Reply

Your email address will not be published. Required fields are marked *