Data World

Archive for the ‘Spark’ Category

Exam 70-775: Big Data Engineering with HDInsight

Posted by Pramod Singla on March 22, 2017

Administer and Provision HDInsight Clusters

  • Deploy HDInsight clusters

    • Create a cluster in a private virtual network, create a cluster that has a custom metastore, create a domain-joined cluster, select an appropriate cluster type based on workload considerations, customize a cluster by using script actions, provision a cluster by using Portal, provision a cluster by using Azure CLI tools, provision a cluster by using Azure Resource Manager (ARM) templates and PowerShell, manage managed disks, configure vNet peering

  • Deploy and secure multi-user HDInsight clusters

    • Provision users who have different roles; manage users, groups, and permissions through Apache Ambari, PowerShell, and Apache Ranger; configure Kerberos; configure service accounts; implement SSH tunneling; restrict access to data

  • Ingest data for batch and interactive processing

    • Ingest data from cloud or on-premises data; store data in Azure Data Lake; store data in Azure Blob Storage; perform routine small writes on a continuous basis using Azure CLI tools; ingest data in Apache Hive and Apache Spark by using Apache Sqoop, Application Development Framework (ADF), AzCopy, and AdlCopy; ingest data from an on-premises Hadoop cluster

  • Configure HDInsight clusters

    • Manage metastore upgrades; view and edit Ambari configuration groups; view and change service configurations through Ambari; access logs written to Azure Table storage; enable heap dumps for Hadoop services; manage HDInsight configuration, use HDInsight .NET SDK, and PowerShell; perform cluster-level debugging; stop and start services through Ambari; manage Ambari alerts and metrics

  • Manage and debug HDInsight jobs

    • Describe YARN architecture and operation; examine YARN jobs through ResourceManager UI and review running applications; use YARN CLI to kill jobs; find logs for different types of jobs; debug Hadoop and Spark jobs; use Azure Operations Management Suite (OMS) to monitor and manage alerts, and perform predictive actions

Implement Big Data Batch Processing Solutions

  • Implement batch solutions with Hive and Apache Pig

    • Define external Hive tables; load data into a Hive table; use partitioning and bucketing to improve Hive performance; use semi-structured files such as XML and JSON with Hive; join tables with Hive using shuffle joins and broadcast joins; invoke Hive UDFs with Java and Python; design scripts with Pig; identify query bottlenecks using the Hive query graph; identify the appropriate storage format, such as Apache Parquet, ORC, Text, and JSON

  • Design batch ETL solutions for big data with Spark

    • Share resources between Spark applications using YARN queues and preemption, select Spark executor and driver settings for optimal performance, use partitioning and bucketing to improve Spark performance, connect to external Spark data sources, incorporate custom Python and Scala code in a Spark DataSets program, identify query bottlenecks using the Spark SQL query graph

  • Operationalize Hadoop and Spark

    • Create and customize a cluster by using ADF; attach storage to a cluster and run an ADF activity; choose between bring-your-own and on-demand clusters; use Apache Oozie with HDInsight; choose between Oozie and ADF; share metastore and storage accounts between a Hive cluster and a Spark cluster to enable the same table across the cluster types; select an appropriate storage type for a data pipeline, such as Blob storage, Azure Data Lake, and local Hadoop Distributed File System (HDFS)

Implement Big Data Interactive Processing Solutions

  • Implement interactive queries for big data with Spark SQL

  • Perform exploratory data analysis by using Spark SQL

    • Use Jupyter and Apache Zeppelin for visualization and developing tidy Spark DataFrames for modeling, use Spark SQL’s two-table joins to merge DataFrames and cache results, save tidied Spark DataFrames to performant format for reading and analysis (Apache Parquet), manage interactive Livy sessions and their resources

  • Implement interactive queries for big data with Interactive Hive

    • Enable Hive LLAP through Hive settings, manage and configure memory allocation for Hive LLAP jobs, connect BI tools to Interactive Hive clusters

  • Perform exploratory data analysis by using Hive

    • Perform interactive querying and visualization, use Ambari Views, use HiveQL, parse CSV files with Hive, use ORC versus Text for caching, use internal and external tables in Hive, use Zeppelin to visualize data

  • Perform interactive processing by using Apache Phoenix on HBase

    • Use Phoenix in HDInsight; use Phoenix Grammar for queries; configure transactions, user-defined functions, and secondary indexes; identify and optimize Phoenix performance; select between Hive, Spark, and Phoenix on HBase for interactive processing; identify when to share metastore between a Hive cluster and a Spark cluster

Implement Big Data Real-Time Processing Solutions

  • Create Spark streaming applications using DStream API

  • Create Spark structured streaming applications

    • Use DataFrames and DataSets APIs to create streaming DataFrames and Datasets; create Window Operations on Event Time; define Window Transformations for Stateful and Stateless Operations; stream Window Functions, Reduce by Key, and Window to Summarize Streaming Data; persist Long Term Data HBase and SQL; persist Long Term Data Azure Data Lake and Azure Blob Storage; stream data from Kafka or Event Hub; visualize streaming data in a PowerBI real-time dashboard

  • Develop big data real-time processing solutions with Apache Storm

  • Build solutions that use Kafka

    • Create Spark and Storm clusters in the virtual network, manage partitions, configure MirrorMaker, start and stop services through Ambari, manage topics

  • Build solutions that use HBase

    • Identify HBase use cases in HDInsight, use HBase Shell to create updates and drop HBase tables, monitor an HBase cluster, optimize the performance of an HBase cluster, identify uses cases for using Phoenix for analytics of real-time data, implement replication in HBase

Useful links:

Posted in Big Data, Certifications, Hadoop, Spark | Tagged: , , , | Leave a Comment »