Select appropriate data ingestion technology based on specific constraints; design partitioning scheme and select mechanism for partitioning; ingest and process data from a Twitter stream; connect to stream processing entities; estimate throughput, latency needs, and job footprint; design reference data streams
Design and implement Azure Stream Analytics
Configure thresholds, use the Azure Machine Learning UDF, create alerts based on conditions, use a machine learning model for scoring, train a model for continuous learning, use common stream processing scenarios
Implement and manage the streaming pipeline
Stream data to a live dashboard, archive data as a storage artifact for batch processing, enable consistency between stream processing and batch processing logic
Query real-time data by using the Azure Stream Analytics query language
Use built-in functions, use data types, identify query language elements, control query windowing by using Time Management, guarantee event delivery
Create an Azure Data Lake Store (ADLS) account, copy data to ADLS, secure data within ADLS by using access control, leverage end-user or service-to-service authentication appropriately, tune the performance of ADLS, access diagnostic logs
Manage Azure Data Lake Analytics
Create an Azure Data Lake Analytics (ADLA) account, manage users, manage data sources, manage, monitor, and troubleshoot jobs, access diagnostic logs, optimize jobs by using the vertex view, identify historical job information
Extract and transform data by using U-SQL
Schematize data on read at scale; generate outputter files; use the U-SQL data types, use C# and U-SQL expression language; identify major differences between T-SQL and U-SQL; perform JOINS, PIVOT, UNPIVOT, CROSS APPLY, and Windowing functions in U-SQL; share data and code through U-SQL catalog; define benefits and use of structured data in U-SQL; manage and secure the Catalog
Extend U-SQL programmability
Use user-defined functions, aggregators, and operators, scale out user-defined operators, call Python, R, and Cognitive capabilities, use U-SQL user-defined types, perform federated queries, share data and code across ADLA and ADLS
Integrate Azure Data Lake Analytics with other services
Integrate with Azure Data Factory, Azure HDInsight, Azure Data Catalog, and Azure Event Hubs, ingest data from Azure SQL Data Warehouse
Choose the optimal type of distribution column to optimize workflows, select a table geometry, limit data skew and process skew through the appropriate selection of distributed columns, design columnstore indexes, identify when to scale compute nodes, calculate the number of distributions for a given workload
Query data in Azure SQL Data Warehouse
Implement query labels, aggregate functions, create and manage statistics in distributed tables, monitor user queries to identify performance issues, change a user resource class
Integrate Azure SQL Data Warehouse with other services
Ingest data into Azure SQL Data Warehouse by using AZCopy, Polybase, Bulk Copy Program (BCP), Azure Data Factory, SQL Server Integration Services (SSIS), Create-Table-As-Select (CTAS), and Create-External-Table-As-Select (CETAS); export data from Azure SQL Data Warehouse; provide connection information to access Azure SQL Data Warehouse from Azure Machine Learning; leverage Polybase to access a different distributed store; migrate data to Azure SQL Data Warehouse; select the appropriate ingestion method based on business needs
Implement availability for the slice, create dataset policies, configure the appropriate linked service based on the activity and the dataset
Move, transform, and analyze data by using Azure Data Factory activities
Copy data between on-premises and the cloud, create different activity types, extend the data factory by using custom processing steps, move data to and from Azure SQL Data Warehouse
Orchestrate data processing by using Azure Data Factory pipelines
Identify data dependencies and chain multiple activities, model schedules based on data dependencies, provision and run data pipelines, design a data flow
Monitor and manage Azure Data Factory
Identify failures and root causes, create alerts for specified conditions, perform a redeploy, use the Microsoft Azure Portal monitoring tool
Provision Azure SQL Data Warehouse, Azure Data Lake, Azure Data Factory, and Azure Stream Analytics
Provision Azure SQL Data Warehouse, Azure Data Lake, and Azure Data Factory, implement Azure Stream Analytics
Implement authentication, authorization, and auditing
Integrate services with Azure Active Directory (Azure AD), use the local security model in Azure SQL Data Warehouse, configure firewalls, implement auditing, integrate services with Azure Data Factory
Manage data recovery for Azure SQL Data Warehouse, Azure Data Lake, and Azure Data Factory, Azure Stream Analytics
Backup and recover services, plan and implement geo-redundancy for Azure Storage, migrate from an on-premises data warehouse to Azure SQL Data Warehouse
Monitor Azure SQL Data Warehouse, Azure Data Lake, and Azure Stream Analytics
Manage concurrency, manage elastic scale for Azure SQL Data Warehouse, monitor workloads by using Dynamic Management Views (DMVs) for Azure SQL Data Warehouse, troubleshoot Azure Data Lake performance by using the Vertex Execution View
Design and implement storage solutions for big data implementations
Optimize storage to meet performance needs, select appropriate storage types based on business requirements, use AZCopy, Storage Explorer and Redgate Azure Explorer to migrate data, design cloud solutions that integrate with on-premises data
Create a cluster in a private virtual network, create a cluster that has a custom metastore, create a domain-joined cluster, select an appropriate cluster type based on workload considerations, customize a cluster by using script actions, provision a cluster by using Portal, provision a cluster by using Azure CLI tools, provision a cluster by using Azure Resource Manager (ARM) templates and PowerShell, manage managed disks, configure vNet peering
Provision users who have different roles; manage users, groups, and permissions through Apache Ambari, PowerShell, and Apache Ranger; configure Kerberos; configure service accounts; implement SSH tunneling; restrict access to data
Ingest data from cloud or on-premises data; store data in Azure Data Lake; store data in Azure Blob Storage; perform routine small writes on a continuous basis using Azure CLI tools; ingest data in Apache Hive and Apache Spark by using Apache Sqoop, Application Development Framework (ADF), AzCopy, and AdlCopy; ingest data from an on-premises Hadoop cluster
A very informative link, definately read material. Have read it over the top, but I think is a very good resource even for future usage.
Key terms: Mahout ML lib for collaborative filtering, Oozie to form multi-step workflows
‘Adaptive Server Connection Failed’ FreeTDS error. Create a SQL database on the fly while creating the cluster to bypass it. From the CMD, the error comes when you create a new database.
The uploading files script doesn’t work, used storage explorer instead to do the job.
Now obsolete, as hdi has moved from win to linux.
Follow this tutorial instead
Manage metastore upgrades; view and edit Ambari configuration groups; view and change service configurations through Ambari; access logs written to Azure Table storage; enable heap dumps for Hadoop services; manage HDInsight configuration, use HDInsight .NET SDK, and PowerShell; perform cluster-level debugging; stop and start services through Ambari; manage Ambari alerts and metrics
Describe YARN architecture and operation; examine YARN jobs through ResourceManager UI and review running applications; use YARN CLI to kill jobs; find logs for different types of jobs; debug Hadoop and Spark jobs; use Azure Operations Management Suite (OMS) to monitor and manage alerts, and perform predictive actions
Implement batch solutions with Hive and Apache Pig
Define external Hive tables; load data into a Hive table; use partitioning and bucketing to improve Hive performance; use semi-structured files such as XML and JSON with Hive; join tables with Hive using shuffle joins and broadcast joins; invoke Hive UDFs with Java and Python; design scripts with Pig; identify query bottlenecks using the Hive query graph; identify the appropriate storage format, such as Apache Parquet, ORC, Text, and JSON
publicclassExampleUDFextendsUDF{// Accept a string inputpublicStringevaluate(Stringinput){// If the value is null, return a nullif(input==null)returnnull;// Lowercase the input string and return itreturninput.toLowerCase();}}
Also has the list of HDI Commands for Powershell, Important for Exam Get-Credential to get username and passwd from powershell script.Here-Strings for complex HQL queries. PS Script used for the tutorial
Hive Query used and Python Script used Pig Script used and Python Script used
New concepts: Jython, Pig runs on a native JVM, Doing hdfs dfs -put file /file actually uploads it to wasbs:/// Powershell and C Python Part of tut left, seems trivial and should be done via revising
Design batch ETL solutions for big data with Spark
Share resources between Spark applications using YARN queues and preemption, select Spark executor and driver settings for optimal performance, use partitioning and bucketing to improve Spark performance, connect to external Spark data sources, incorporate custom Python and Scala code in a Spark DataSets program, identify query bottlenecks using the Spark SQL query graph
Create and customize a cluster by using ADF; attach storage to a cluster and run an ADF activity; choose between bring-your-own and on-demand clusters; use Apache Oozie with HDInsight; choose between Oozie and ADF; share metastore and storage accounts between a Hive cluster and a Spark cluster to enable the same table across the cluster types; select an appropriate storage type for a data pipeline, such as Blob storage, Azure Data Lake, and local Hadoop Distributed File System (HDFS)
Implement interactive queries for big data with Spark SQL
Execute queries using Spark SQL, cache Spark DataFrames for iterative queries, save Spark DataFrames as Parquet files, connect BI tools to Spark clusters, optimize join types such as broadcast versus merge joins, manage Spark Thrift server and change the YARN resources allocation, identify use cases for different storage types for interactive queries
Perform exploratory data analysis by using Spark SQL
Use Jupyter and Apache Zeppelin for visualization and developing tidy Spark DataFrames for modeling, use Spark SQL’s two-table joins to merge DataFrames and cache results, save tidied Spark DataFrames to performant format for reading and analysis (Apache Parquet), manage interactive Livy sessions and their resources
Perform interactive querying and visualization, use Ambari Views, use HiveQL, parse CSV files with Hive, use ORC versus Text for caching, use internal and external tables in Hive, use Zeppelin to visualize data
Perform interactive processing by using Apache Phoenix on HBase
Use Phoenix in HDInsight; use Phoenix Grammar for queries; configure transactions, user-defined functions, and secondary indexes; identify and optimize Phoenix performance; select between Hive, Spark, and Phoenix on HBase for interactive processing; identify when to share metastore between a Hive cluster and a Spark cluster
The correct command to run phoenix queries is /usr/hdp//phoenix/bin/sqlline.py zookeeper_host:2181:/hbase-unsecure and not the one mentioned in the tutorials.
Create Spark streaming applications using DStream API
Define DStreams and compare them to Resilient Distributed Dataset (RDDs), start and stop streaming applications, transform DStream (flatMap, reduceByKey, UpdateStateByKey), persist long-term data stores in HBase and SQL, persist Long Term Data Azure Data Lake and Azure Blob Storage, stream data from Apache Kafka or Event Hub, visualize streaming data in a PowerBI real-time dashboard
Use DataFrames and DataSets APIs to create streaming DataFrames and Datasets; create Window Operations on Event Time; define Window Transformations for Stateful and Stateless Operations; stream Window Functions, Reduce by Key, and Window to Summarize Streaming Data; persist Long Term Data HBase and SQL; persist Long Term Data Azure Data Lake and Azure Blob Storage; stream data from Kafka or Event Hub; visualize streaming data in a PowerBI real-time dashboard
Develop big data real-time processing solutions with Apache Storm
Create Storm clusters for real-time jobs, persist Long Term Data HBase and SQL, persist Long Term Data Azure Data Lake and Azure Blob Storage, stream data from Kafka or Event Hub, configure event windows in Storm, visualize streaming data in a PowerBI real-time dashboard, define Storm topologies and describe Storm Computation Graph Architecture, create Storm streams and conduct streaming joins, run Storm topologies in local mode for testing, configure Storm applications (Workers, Debug mode), conduct Stream groupings to broadcast tuples across components, debug and monitor Storm jobs
Create Spark and Storm clusters in the virtual network, manage partitions, configure MirrorMaker, start and stop services through Ambari, manage topics
Identify HBase use cases in HDInsight, use HBase Shell to create updates and drop HBase tables, monitor an HBase cluster, optimize the performance of an HBase cluster, identify uses cases for using Phoenix for analytics of real-time data, implement replication in HBase
Data model: Data model involves providing an ML algorithm (that is, the learning algorithm) with training data to learn from. The term ML model refers to the model artifact that is created by the training process.
Data Leakage: Dictionary meaning of leakage is “deliberate disclosure of confidential information”. So, data leakage means leaking of some data to your training model which can lead to over-fitting. e.g.
Including feature as label in model training
Including test data into training data
Distorting information from samples outside of scope of the model’s intended use.
Include Information from data samples outside of scope of the algorithm’s intended use.
Features: Features are the variables found in the given problem set that can strongly/sufficiently help us build an accurate predictive model.
Data Label vs Feature : Feature is input; label is output.
Cross validation: A mechanism for estimating how well a model will predict to new data by testing the model against one or more non-overlapping data subsets withheld from the training set.
Over-fitting vs Under-fitting vs ideal fit a model:
Variance Vs Bias :error(X) = noise(X) + bias(X) + variance(X).
bias(X):Learning wrong things. Away from accuracy. Under-fitting.
False Positive vs False Negative:A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.
Model parameter vs Model hyper-parameter:A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data.Whereas, A model hyper-parameter is a configuration that is external to the model and whose value is usually set by the data scientist.