Azure 70-775: Big Data Engineering with HDInsight

1) Administer and Provision HDInsight Clusters

Deploy HDInsight clusters
- Create a cluster in a private virtual network, create a cluster that has a custom metastore, create a domain-joined cluster, select an appropriate cluster type based on workload considerations, customize a cluster by using script actions, provision a cluster by using Portal, provision a cluster by using Azure CLI tools, provision a cluster by using Azure Resource Manager (ARM) templates and PowerShell, manage managed disks, configure vNet peering
- 1. Connecting HDInsight Clusters with Virtual Networks
    
    Some key terms: Forced Tunneling,Software and Hardware VPN, Recursive Resolver
    
    Using the Azure Portal to Create Customized HDInsight Clusters
    
    Configure Domain-joined HDInsight clusters
    
    The diagram is very informative and conveys the idea of the tutorial.
    
    Customize Linux-based HDInsight clusters using Script Action
    
    The Powershell script needed extra -ResourceManager parameter to work, modified script
    Some key terms: ZooKeeper,Persisted and Ad-Hoc Scripts
    
    https://community.hortonworks.com/articles/797/hdinsight-deployment-best-practices.html
Deploy and secure multi-user HDInsight clusters
- Provision users who have different roles; manage users, groups, and permissions through Apache Ambari, PowerShell, and Apache Ranger; configure Kerberos; configure service accounts; implement SSH tunneling; restrict access to data
- 1. Configure Domain-Joined HDInsight Clusters
    
    You should be owner of your subsciption and your sunscription must have access to make premium clusters to proceed
    
    Manage Domain-joined HDInsight clusters
    
    Configure Hive policies in Domain-joined HDInsight
    
    Configure Domain-joined HDInsight clusters (Preview) using Azure PowerShell
    
    Video – Configure AAD, and create HDInsight cluster
    
    Powershell Script for Domain-Joining HDInsight clusters
    
    Video – Ignite 2016: Secure your Enterprise Hadoop environments on Azure
Ingest data for batch and interactive processing
- Ingest data from cloud or on-premises data; store data in Azure Data Lake; store data in Azure Blob Storage; perform routine small writes on a continuous basis using Azure CLI tools; ingest data in Apache Hive and Apache Spark by using Apache Sqoop, Application Development Framework (ADF), AzCopy, and AdlCopy; ingest data from an on-premises Hadoop cluster
- 1. Collecting and loading data into HDInsight
    
    A very informative link, definately read material. Have read it over the top, but I think is a very good resource even for future usage.
    Key terms: Mahout ML lib for collaborative filtering, Oozie to form multi-step workflows
    
    Upload Data Using Command Line Tools
    
    Powershell Script Used for the tutorial
    Further work can be done to Mount Blob storage as local drive. and Hadoop CLI
    Some new things encountered: Significance of 0 byte files and / character in blob storage
    
    Using Sqoop with HDInsight through SSH
    
    ‘Adaptive Server Connection Failed’ FreeTDS error. Create a SQL database on the fly while creating the cluster to bypass it. From the CMD, the error comes when you create a new database.
    
    Use Oozie with Hadoop to Define and Run a Workflow
    
    The uploading files script doesn’t work, used storage explorer instead to do the job.
    Now obsolete, as hdi has moved from win to linux.
    Follow this tutorial instead
    
    Extra Tutorial on analysis using Hive and HDI and HDFS
Configure HDInsight clusters
- Manage metastore upgrades; view and edit Ambari configuration groups; view and change service configurations through Ambari; access logs written to Azure Table storage; enable heap dumps for Hadoop services; manage HDInsight configuration, use HDInsight .NET SDK, and PowerShell; perform cluster-level debugging; stop and start services through Ambari; manage Ambari alerts and metrics
- 1. Customize HDInsight Clusters with Bootstrap Configurations
    
    Monitoring HDInsight with Ambari API
    
    REST API to comm. with Ambari for info
    
    Manage HDInsight Using Azure CLI
Manage and debug HDInsight jobs
- Describe YARN architecture and operation; examine YARN jobs through ResourceManager UI and review running applications; use YARN CLI to kill jobs; find logs for different types of jobs; debug Hadoop and Spark jobs; use Azure Operations Management Suite (OMS) to monitor and manage alerts, and perform predictive actions
- 1. Manage Hadoop Clusters in HDInsight with the Ambari Portal
    
    Manage HDInsight Clusters wtih the Ambari Web UI
    
    Important, implements SSH Tunneling
    Ambari and Hive Views
    
    Video: Manage and Troubleshoot Infrastructure with Operations Management Suite
    
    GitHub Scripts to Monitor HBase Clusters with OMS

2) Implement Big Data Batch Processing Solutions

Implement batch solutions with Hive and Apache Pig
- Define external Hive tables; load data into a Hive table; use partitioning and bucketing to improve Hive performance; use semi-structured files such as XML and JSON with Hive; join tables with Hive using shuffle joins and broadcast joins; invoke Hive UDFs with Java and Python; design scripts with Pig; identify query bottlenecks using the Hive query graph; identify the appropriate storage format, such as Apache Parquet, ORC, Text, and JSON
  - Hive tables
  - Use partitioning and bucketing to improve Hive performance
    
    More on partitioning
    Bucketing and Partitioning examples Bucketing vs Partitioning More on ORC
    More on Vectorisation
  - Use semi-structured files such as XML and JSON with Hive
    
    XML Files not done
    JSoN SerDe not done
  - Join tables with Hive using shuffle joins and broadcast joins
    
    Slides on various Joins
    Broadcast Join syntax
  - Invoke Hive UDFs with Java and Python
    Java UDF format : </br>
    public class ExampleUDF extends UDF { // Accept a string input public String evaluate(String input) { // If the value is null, return a null if(input == null) return null; // Lowercase the input string and return it return input.toLowerCase(); } }
    Hive Query
    ADD JAR wasbs:///example/jars/ExampleUDF-1.0-SNAPSHOT.jar; CREATE TEMPORARY FUNCTION tolower as 'com.microsoft.examples.ExampleUDF'; SELECT tolower(deviceplatform) FROM hivesampletable LIMIT 10;
    
    Hive QL Intro
    
    New terms: LLAP to speed up HQL in Hive 2.0, Apache Tez
    
    Hive Queries with Powershell
    
    Also has the list of HDI Commands for Powershell, Important for Exam
    Get-Credential to get username and passwd from powershell script.Here-Strings for complex HQL queries.
    PS Script used for the tutorial
    
    Hive Queries with Interactive Hive View
    
    Visualisation tools offer a good insight
    
    Connecting Excel ODBC to Hive
    
    Install both 32 and 64 bit drivers, to avoid “Application Driver Mismatch” Error.
    
    JSON SerDe with Hive
    
    mvn package doesn’t get tars in target, some probelms are there.
    Try it on linux later
    
    JSON UDFs
    
    Not clear on import command
    New Terms: Lateral View in Hive
    
    Use Pig with HDInsight
    
    Script used for the tutorial and Pig Batch Querry used
    
    Use Python with Hive and Pig in HDInsight
    
    Hive Query used and Python Script used
    Pig Script used and Python Script used
    New concepts: Jython, Pig runs on a native JVM, Doing hdfs dfs -put file /file actually uploads it to wasbs:///
    Powershell and C Python Part of tut left, seems trivial and should be done via revising
    
    Use Java UDFs for Pig and Hive (1) and (2)
    
    HDInsight Tools for Visual Studio
    
    Also attempt Hive with VS
    
    Extra Links
    
    Sensor Data Analysis using Hive
    
    Try hands on partitioning here !
    
    Flight Delay Data using Hive
    
    Try hands on bucketing here !
    
    Nice presentation on Hive Optimisation
Design batch ETL solutions for big data with Spark
- Share resources between Spark applications using YARN queues and preemption, select Spark executor and driver settings for optimal performance, use partitioning and bucketing to improve Spark performance, connect to external Spark data sources, incorporate custom Python and Scala code in a Spark DataSets program, identify query bottlenecks using the Spark SQL query graph
- 1. Useful Links
    
    Spark SQL with Azure HDInsight
    
    Integrating Hive and BI Tools with Spark
Operationalize Hadoop and Spark
- Create and customize a cluster by using ADF; attach storage to a cluster and run an ADF activity; choose between bring-your-own and on-demand clusters; use Apache Oozie with HDInsight; choose between Oozie and ADF; share metastore and storage accounts between a Hive cluster and a Spark cluster to enable the same table across the cluster types; select an appropriate storage type for a data pipeline, such as Blob storage, Azure Data Lake, and local Hadoop Distributed File System (HDFS)
- 1. Useful Links
    
    Hive Activity in Azure Data Factory
    
    Pig Activity in Azure Data Factory
    
    MapReduce Activity for Azure Data Factory
    
    HDInsight Activities in Azure Data Factory
    
    Spark Activities for Azure Data Factory

3) Implement Big Data Interactive Processing Solutions

Implement interactive queries for big data with Spark SQL
- Execute queries using Spark SQL, cache Spark DataFrames for iterative queries, save Spark DataFrames as Parquet files, connect BI tools to Spark clusters, optimize join types such as broadcast versus merge joins, manage Spark Thrift server and change the YARN resources allocation, identify use cases for different storage types for interactive queries
Perform exploratory data analysis by using Spark SQL
- Use Jupyter and Apache Zeppelin for visualization and developing tidy Spark DataFrames for modeling, use Spark SQL’s two-table joins to merge DataFrames and cache results, save tidied Spark DataFrames to performant format for reading and analysis (Apache Parquet), manage interactive Livy sessions and their resources
- 1. Use Livy to Submit Spark Jobs Remotely
    
    Use Zeppelin Notebooks with HDInsight Spark Clusters
    
    Scala/Java class structure: GroupId:ArtifactId:Version
    
    Use Jupyter Notbeooks with HDInsight Spark Clusters
    
    Adding External Packages to Jupyter Notebooks in HDInsight Spark Clusters
    
    %%configure magic configures the underlying livy session to use the package you provided
Implement interactive queries for big data with Interactive Hive
- Enable Hive LLAP through Hive settings, manage and configure memory allocation for Hive LLAP jobs, connect BI tools to Interactive Hive clusters
- 1. Enable Hive LLAP through Hive settings
    
    Manage and configure memory allocation for Hive LLAP jobs
    
    Through Ambari->Hive->Configs->Interactive Query
    
    Connect BI tools to Interactive Hive clusters
    
    Perform interactive querying and visualization
    
    Use Ambari Views
    
    Use HiveQL
    
    Parse CSV files with Hive
    
    CREATE TABLE TAB_NAME (COL1 COL_TYPE1 COL2 COL_TYPE2) ROW FORMAT DELIMITED FIELDS TERMIBATED BY LOAD DATA LOCAL INPATH 'wasbs://yourcsvfile.csv' INTO TABLE TAB_NAME
    
    Use ORC versus Text for caching
    
    CREATE TABLE IF NOT EXISTS TAB_NAME (COL1 COL_TYPE1 COL2 COL_TYPE2) STORED AS ORC
    
    Use internal and external tables in Hive
    
    Use Zeppelin to visualize data
Perform exploratory data analysis by using Hive
- Perform interactive querying and visualization, use Ambari Views, use HiveQL, parse CSV files with Hive, use ORC versus Text for caching, use internal and external tables in Hive, use Zeppelin to visualize data
- Useful Links
  - Use Interactive Hive in HDInsight
Perform interactive processing by using Apache Phoenix on HBase
- Use Phoenix in HDInsight; use Phoenix Grammar for queries; configure transactions, user-defined functions, and secondary indexes; identify and optimize Phoenix performance; select between Hive, Spark, and Phoenix on HBase for interactive processing; identify when to share metastore between a Hive cluster and a Spark cluster
- 1. Useful Links
    
    Use Phoenix with HBase Clusters
    
    The correct command to run phoenix queries is /usr/hdp//phoenix/bin/sqlline.py zookeeper_host:2181:/hbase-unsecure and not the one mentioned in the tutorials.
    
    Phoenix Grammar
    
    Bulk Import in Phoenix

4) Implement Big Data Real-Time Processing Solutions

Create Spark streaming applications using DStream API
- Define DStreams and compare them to Resilient Distributed Dataset (RDDs), start and stop streaming applications, transform DStream (flatMap, reduceByKey, UpdateStateByKey), persist long-term data stores in HBase and SQL, persist Long Term Data Azure Data Lake and Azure Blob Storage, stream data from Apache Kafka or Event Hub, visualize streaming data in a PowerBI real-time dashboard
Create Spark structured streaming applications
- Use DataFrames and DataSets APIs to create streaming DataFrames and Datasets; create Window Operations on Event Time; define Window Transformations for Stateful and Stateless Operations; stream Window Functions, Reduce by Key, and Window to Summarize Streaming Data; persist Long Term Data HBase and SQL; persist Long Term Data Azure Data Lake and Azure Blob Storage; stream data from Kafka or Event Hub; visualize streaming data in a PowerBI real-time dashboard
  - structured-streaming-programming-guide
  - hdinsight-apache-kafka-spark-structured-streaming
Develop big data real-time processing solutions with Apache Storm
- Create Storm clusters for real-time jobs, persist Long Term Data HBase and SQL, persist Long Term Data Azure Data Lake and Azure Blob Storage, stream data from Kafka or Event Hub, configure event windows in Storm, visualize streaming data in a PowerBI real-time dashboard, define Storm topologies and describe Storm Computation Graph Architecture, create Storm streams and conduct streaming joins, run Storm topologies in local mode for testing, configure Storm applications (Workers, Debug mode), conduct Stream groupings to broadcast tuples across components, debug and monitor Storm jobs
Build solutions that use Kafka
- Create Spark and Storm clusters in the virtual network, manage partitions, configure MirrorMaker, start and stop services through Ambari, manage topics
Build solutions that use HBase
- Identify HBase use cases in HDInsight, use HBase Shell to create updates and drop HBase tables, monitor an HBase cluster, optimize the performance of an HBase cluster, identify uses cases for using Phoenix for analytics of real-time data, implement replication in HBase
- 1. Getting Started with HBase on HDInsight
    
    Adding HBase to Azure Virtual Network
    
    Configuring HBase Replication
    
    Real Time Processing with HBase
    
    Do Using Storm and HBase for sensor data to revise, instead of the twitter one, will clear concepts of both Storm and HBase.
    
    More on HBase

Useful links:

Syllabus and enrolment

Spark Interview Questions
Spark Streaming
https://github.com/Azure/learnAnalytics-public/blob/master/HDInsight/Community-Guide-775.md
https://learnanalytics.microsoft.com/home/certifications
https://azure.microsoft.com/en-us/documentation/learning-paths/hdinsight-self-guided-hadoop-training/
https://blogs.msdn.microsoft.com/cindygross/2015/02/04/understanding-wasb-and-hadoop-storage-in-azure/
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage#hdinsight-storage-architecture
http://www.cs.virginia.edu/~hs6ms/publishedPaper/Conference/2016/Scale-up-out-Cloud2016.pdf

Deploy HDInsight clusters

Deploy and secure multi-user HDInsight clusters

Provision users who have different roles; manage users, groups, and permissions through Apache Ambari, PowerShell, and Apache Ranger; configure Kerberos; configure service accounts; implement SSH tunneling; restrict access to data

Ingest data for batch and interactive processing

Configure HDInsight clusters

Manage and debug HDInsight jobs

Implement batch solutions with Hive and Apache Pig

Design batch ETL solutions for big data with Spark

Operationalize Hadoop and Spark

Implement interactive queries for big data with Spark SQL

Perform exploratory data analysis by using Spark SQL

Implement interactive queries for big data with Interactive Hive

Enable Hive LLAP through Hive settings, manage and configure memory allocation for Hive LLAP jobs, connect BI tools to Interactive Hive clusters

Perform exploratory data analysis by using Hive

Perform interactive querying and visualization, use Ambari Views, use HiveQL, parse CSV files with Hive, use ORC versus Text for caching, use internal and external tables in Hive, use Zeppelin to visualize data

Perform interactive processing by using Apache Phoenix on HBase

Create Spark streaming applications using DStream API

Create Spark structured streaming applications

Develop big data real-time processing solutions with Apache Storm

Build solutions that use Kafka

Create Spark and Storm clusters in the virtual network, manage partitions, configure MirrorMaker, start and stop services through Ambari, manage topics

Build solutions that use HBase

Identify HBase use cases in HDInsight, use HBase Shell to create updates and drop HBase tables, monitor an HBase cluster, optimize the performance of an HBase cluster, identify uses cases for using Phoenix for analytics of real-time data, implement replication in HBase

Useful links:

https://github.com/Azure/learnAnalytics-public/blob/master/HDInsight/Community-Guide-775.md

https://learnanalytics.microsoft.com/home/certifications

https://azure.microsoft.com/en-us/documentation/learning-paths/hdinsight-self-guided-hadoop-training/

https://blogs.msdn.microsoft.com/cindygross/2015/02/04/understanding-wasb-and-hadoop-storage-in-azure/

http://www.cs.virginia.edu/~hs6ms/publishedPaper/Conference/2016/Scale-up-out-Cloud2016.pdf

Leave a Reply Cancel reply