Archive for the ‘Big Data’ Category
Machine Learning : Introduction To ML in Azure Databricks
Posted by Pramod Singla on September 23, 2018
Posted in Cloud, Databricks, Machine Learning, Spark | Tagged: Azure, Big data, Databricks, H20.ai, Machine Learning, MFlfow, Sparkling water | Leave a Comment »
Exam 70-776: Perform Big Data Engineering on Microsoft Cloud Services
Posted by Pramod Singla on January 4, 2018
Design and Implement Complex Event Processing By Using Azure Stream Analytics (15-20%)
- Ingest data for real-time processing
- What is Stream Analytics?
- Get started with Azure Stream Analytics to process data from IoT devices
- Choosing a streaming analytics platform: comparing Apache Storm and Azure Stream Analytics
- Data connection: Learn about data stream inputs from events to Stream Analytics
- Event Processing Ordering Design Choices for Azure Stream Analytics
- Stream Analytics outputs: Options for storage, analysis
- Real-time Twitter sentiment analysis in Azure Stream Analytics
- Scale Azure Stream Analytics jobs to increase stream data processing throughput
- Optimize your job to use Streaming Units efficiently
- Using reference data or lookup tables in a Stream Analytics input stream
- Design and implement Azure Stream Analytics
- Machine Learning integration in Stream Analytics
- Streaming Events to AzureML Through Azure Stream Analytics
- Set up alerts for Azure Stream Analytics jobs
- Processing Configurable Threshold Based Rules in Azure Stream Analytics
- Create metric alerts in Azure Monitor for Azure services – Azure portal
- Performing sentiment analysis by using Azure Stream Analytics and Azure Machine Learning
- Machine Learning-based anomaly detection in Azure Stream Analytics
- Implement and manage the streaming pipeline
- Query real-time data by using the Azure Stream Analytics query language
- Stream Analytics Query Language Reference
- Query examples for common Stream Analytics usage patterns
- Deep Dive: Azure Stream Analytics Query Language
- Built-in Functions (Azure Stream Analytics)
- Data Types (Azure Stream Analytics)
- TIMESTAMP BY (Azure Stream Analytics)
- Event Delivery Guarantees (Azure Stream Analytics)
- How to achieve exactly-once delivery for SQL output
Design and Implement Analytics by Using Azure Data Lake (25-30%)
- Ingest data into Azure Data Lake Store
- Introduction to Azure Data Lake Store
- Get started with Azure Data Lake Store using the Azure portal
- Get started with Azure Data Lake Store using Azure PowerShell
- Get started with Azure Data Lake Store using .NET SDK
- Copy data to and from Data Lake Store by using Data Factory
- Copy data from Azure Storage Blobs to Data Lake Store
- Use Distcp to copy data between Azure Storage Blobs and Data Lake Store
- Copy data between Data Lake Store and Azure SQL database using Sqoop
- Use the Azure Import/Export service for offline copy of data to Data Lake Store
- Azure Data Lake – Security Essentials
- Securing Azure Data Lake Store
- The Intelligent Data Lake
- Tuning Azure Data Lake Store for performance
- Accessing diagnostic logs for Azure Data Lake Store
- Manage Azure Data Lake Analytics
- Overview of Microsoft Azure Data Lake Analytics
- Get started with Azure Data Lake Analytics using Azure portal
- Manage Azure Data Lake Analytics by using the Azure portal
- Manage Azure Data Lake Analytics using Azure Command-line Interface (CLI)
- Manage Azure Data Lake Analytics using Azure PowerShell
- Manage Azure Data Lake Analytics using Python
- Azure Data Lake Analytics Quota Limits
- Azure Data Lake Developer Tools
- Azure Data Lakes
- Troubleshoot Azure Data Lake Analytics jobs using Azure Portal
- Use Job Browser and Job View for Azure Data lake Analytics jobs
- Use the Vertex Execution View in Data Lake Tools for Visual Studio
- Extract and transform data by using U-SQL
- U-SQL programmability guide
- Stairway to U-SQL
- Data Types and Literals (U-SQL)
- Develop U-SQL user-defined operators (UDOs)
- Extending U-SQL Expressions with User-Code
- Introducing U-SQL – A Language that makes Big Data Processing Easy
- Stairway to U-SQL Level 8: Joining Tables in U-SQL
- Stairway to U-SQL Level 10: Table-Valued Functions and UDTs
- Stairway to U-SQL Level 16: The Azure Data Lake Catalog
- Get started with the U-SQL Catalog
- U-SQL Assemblies
- Extend U-SQL programmability
- Writing and Using Custom Code in U-SQL – User-Defined Functions
- Tutorial: Get started with extending U-SQL with Python
- Tutorial: Get started with extending U-SQL with R
- Tutorial: Get started with the Cognitive capabilities of U-SQL
- U-SQL Federated Distributed Queries (SQLBits 2016)
- Setup Azure Data Lake Analytics federated U-SQL queries to Azure SQL Database
- Integrate Azure Data Lake Analytics with other services
- Transform data by running U-SQL scripts on Azure Data Lake Analytics
- Azure Data Lake & Azure HDInsight Blog
- Creating big data pipelines using Azure Data Lake and Azure Data Factory
- Azure Data Lake now integrated with Azure Data Catalog
- Azure Data Lake and Azure Data Warehouse: Applying Modern Practices to Your App
- Directly store streaming data into Azure Data Lake with Azure Event Hubs Capture Provider
- Query Event Hubs Archive with Azure Data Lake Analytics and U-SQL
Design and Implement Azure SQL Data Warehouse Solutions (15-20%)
- Design tables in Azure SQL Data Warehouse
- Azure DWH part 1:Common questions about Azure SQL Data Warehouse
- Distributing tables in SQL Data Warehouse
- Overview of tables in SQL Data Warehouse
- Distributed data and distributed tables for Massively Parallel Processing (MPP)
- Clustered Columnstore Tables are the New Default in Azure SQL Data Warehouse
- Columnstore indexes – overview
- Query data in Azure SQL Data Warehouse
- Use labels to instrument queries in SQL Data Warehouse
- Aggregate Functions (Transact-SQL)
- Managing statistics on tables in SQL Data Warehouse
- Monitor user queries in Azure SQL Data Warehouse
- Monitor your workload using DMVs
- Maximizing rowgroup quality for columnstore
- Concurrency and workload management in SQL Data Warehouse
- Integrate Azure SQL Data Warehouse with other services
- Azure DWH part 15: PolyBase and Azure Data Lake
- Azure DWH part 14: PolyBase to access to Non relational data
- Azure DWH part 4: How to import data to Azure DWH using SSIS
- Load data into Azure SQL Data Warehouse
- Azure Machine Learning – Your first experiment
- Create Table As Select (CTAS) in SQL Data Warehouse
- CREATE EXTERNAL TABLE AS SELECT (Transact-SQL)
- Copy data to and from Azure SQL Data Warehouse using Azure Data Factory
- Migrate Your Data
Design and Implement Cloud-Based Integration by using Azure Data Factory (15-20%)
- Implement datasets and linked services.
- Move, transform, and analyze data by using Azure Data Factory activities
- Tutorial: Copy data from Blob Storage to SQL Database using Data Factory
- Copy data to and from an on-premises file system by using Azure Data Factory
- Pipelines and Activities in Azure Data Factory
- Process large-scale datasets using Data Factory and Batch
- Use Azure Data Factory with SQL Data Warehouse
- Orchestrate data processing by using Azure Data Factory pipelines
- Monitor and manage Azure Data Factory
- Monitor and manage Azure Data Factory pipelines by using the Azure portal and PowerShell
- Monitor and manage Azure Data Factory pipelines by using the Monitoring and Management app
- Troubleshoot Data Factory issues
- Troubleshoot issues with using Data Management Gateway
- Azure Data Factory – Frequently Asked Questions
Manage and Maintain Azure SQL Data Warehouse, Azure Data Lake, Azure Data Factory, and Azure Stream Analytics (20-25%)
- Provision Azure SQL Data Warehouse, Azure Data Lake, Azure Data Factory, and Azure Stream Analytics
- Implement authentication, authorization, and auditing
- Use Azure Active Directory Authentication for authentication with SQL Database or SQL Data Warehouse
- Azure SQL Database server-level and database-level firewall rules
- Working with Azure Active Directory and Azure SQL Database
- What is Azure Active Directory?
- Secure a database in SQL Data Warehouse
- Auditing in Azure SQL Data Warehouse
- Debug Stream Analytics jobs using service and operation logs
- Troubleshooting guide for Azure Stream Analytics
- Azure Logging and Auditing
- Security in Azure Data Lake Store
- Manage data recovery for Azure SQL Data Warehouse, Azure Data Lake, and Azure Data Factory, Azure Stream Analytics
- Monitor Azure SQL Data Warehouse, Azure Data Lake, and Azure Stream Analytics
- Design and implement storage solutions for big data implementations.
Useful Links
https://www.microsoft.com/en-us/learning/exam-70-776.aspx
Posted in Big Data, Certifications, Machine Learning | Tagged: ADLA, Azure data lake, azure data lake analytics, Azure Machine Learning, Big data, Big Data Certification, data factory | 1 Comment »
Exam 70-776: Perform Big Data Engineering on Microsoft Cloud Services
Posted by Pramod Singla on October 6, 2017
Design and Implement Complex Event Processing By Using Azure Stream Analytics (15-20%)
- Ingest data for real-time processing
- Select appropriate data ingestion technology based on specific constraints; design partitioning scheme and select mechanism for partitioning; ingest and process data from a Twitter stream; connect to stream processing entities; estimate throughput, latency needs, and job footprint; design reference data streams
- Design and implement Azure Stream Analytics
- Configure thresholds, use the Azure Machine Learning UDF, create alerts based on conditions, use a machine learning model for scoring, train a model for continuous learning, use common stream processing scenarios
- Implement and manage the streaming pipeline
- Stream data to a live dashboard, archive data as a storage artifact for batch processing, enable consistency between stream processing and batch processing logic
- Query real-time data by using the Azure Stream Analytics query language
- Use built-in functions, use data types, identify query language elements, control query windowing by using Time Management, guarantee event delivery
Design and Implement Analytics by Using Azure Data Lake (25-30%)
- Ingest data into Azure Data Lake Store
- Create an Azure Data Lake Store (ADLS) account, copy data to ADLS, secure data within ADLS by using access control, leverage end-user or service-to-service authentication appropriately, tune the performance of ADLS, access diagnostic logs
- Manage Azure Data Lake Analytics
- Create an Azure Data Lake Analytics (ADLA) account, manage users, manage data sources, manage, monitor, and troubleshoot jobs, access diagnostic logs, optimize jobs by using the vertex view, identify historical job information
- Extract and transform data by using U-SQL
- Schematize data on read at scale; generate outputter files; use the U-SQL data types, use C# and U-SQL expression language; identify major differences between T-SQL and U-SQL; perform JOINS, PIVOT, UNPIVOT, CROSS APPLY, and Windowing functions in U-SQL; share data and code through U-SQL catalog; define benefits and use of structured data in U-SQL; manage and secure the Catalog
- Extend U-SQL programmability
- Use user-defined functions, aggregators, and operators, scale out user-defined operators, call Python, R, and Cognitive capabilities, use U-SQL user-defined types, perform federated queries, share data and code across ADLA and ADLS
- Integrate Azure Data Lake Analytics with other services
- Integrate with Azure Data Factory, Azure HDInsight, Azure Data Catalog, and Azure Event Hubs, ingest data from Azure SQL Data Warehouse
Design and Implement Azure SQL Data Warehouse Solutions (15-20%)
- Design tables in Azure SQL Data Warehouse
- Choose the optimal type of distribution column to optimize workflows, select a table geometry, limit data skew and process skew through the appropriate selection of distributed columns, design columnstore indexes, identify when to scale compute nodes, calculate the number of distributions for a given workload
- Query data in Azure SQL Data Warehouse
- Implement query labels, aggregate functions, create and manage statistics in distributed tables, monitor user queries to identify performance issues, change a user resource class
- Integrate Azure SQL Data Warehouse with other services
- Ingest data into Azure SQL Data Warehouse by using AZCopy, Polybase, Bulk Copy Program (BCP), Azure Data Factory, SQL Server Integration Services (SSIS), Create-Table-As-Select (CTAS), and Create-External-Table-As-Select (CETAS); export data from Azure SQL Data Warehouse; provide connection information to access Azure SQL Data Warehouse from Azure Machine Learning; leverage Polybase to access a different distributed store; migrate data to Azure SQL Data Warehouse; select the appropriate ingestion method based on business needs
Design and Implement Cloud-Based Integration by using Azure Data Factory (15-20%)
- Implement datasets and linked services
- Implement availability for the slice, create dataset policies, configure the appropriate linked service based on the activity and the dataset
- Move, transform, and analyze data by using Azure Data Factory activities
- Copy data between on-premises and the cloud, create different activity types, extend the data factory by using custom processing steps, move data to and from Azure SQL Data Warehouse
- Orchestrate data processing by using Azure Data Factory pipelines
- Identify data dependencies and chain multiple activities, model schedules based on data dependencies, provision and run data pipelines, design a data flow
- Monitor and manage Azure Data Factory
- Identify failures and root causes, create alerts for specified conditions, perform a redeploy, use the Microsoft Azure Portal monitoring tool
- Provision Azure SQL Data Warehouse, Azure Data Lake, Azure Data Factory, and Azure Stream Analytics
- Provision Azure SQL Data Warehouse, Azure Data Lake, and Azure Data Factory, implement Azure Stream Analytics
- Implement authentication, authorization, and auditing
- Integrate services with Azure Active Directory (Azure AD), use the local security model in Azure SQL Data Warehouse, configure firewalls, implement auditing, integrate services with Azure Data Factory
- Manage data recovery for Azure SQL Data Warehouse, Azure Data Lake, and Azure Data Factory, Azure Stream Analytics
- Backup and recover services, plan and implement geo-redundancy for Azure Storage, migrate from an on-premises data warehouse to Azure SQL Data Warehouse
- Monitor Azure SQL Data Warehouse, Azure Data Lake, and Azure Stream Analytics
- Manage concurrency, manage elastic scale for Azure SQL Data Warehouse, monitor workloads by using Dynamic Management Views (DMVs) for Azure SQL Data Warehouse, troubleshoot Azure Data Lake performance by using the Vertex Execution View
- Design and implement storage solutions for big data implementations
- Optimize storage to meet performance needs, select appropriate storage types based on business requirements, use AZCopy, Storage Explorer and Redgate Azure Explorer to migrate data, design cloud solutions that integrate with on-premises data
Useful Links:
https://www.microsoft.com/en-us/learning/exam-70-776.asp
Posted in Azure, Big Data, Certifications, Hadoop | Tagged: Azure data lake, Azure Hdinsight, Big data, Exam 70-776, Perform Big Data Engineering on Microsoft Cloud Services | Leave a Comment »
Exam 70-775: Big Data Engineering with HDInsight
Posted by Pramod Singla on March 22, 2017
1) Administer and Provision HDInsight Clusters
-
Deploy HDInsight clusters
-
Create a cluster in a private virtual network, create a cluster that has a custom metastore, create a domain-joined cluster, select an appropriate cluster type based on workload considerations, customize a cluster by using script actions, provision a cluster by using Portal, provision a cluster by using Azure CLI tools, provision a cluster by using Azure Resource Manager (ARM) templates and PowerShell, manage managed disks, configure vNet peering
-
-
- Connecting HDInsight Clusters with Virtual Networks
Some key terms: Forced Tunneling,Software and Hardware VPN, Recursive Resolver
- Using the Azure Portal to Create Customized HDInsight Clusters
- Configure Domain-joined HDInsight clusters
The diagram is very informative and conveys the idea of the tutorial.
- Customize Linux-based HDInsight clusters using Script Action
The Powershell script needed extra -ResourceManager parameter to work, modified script
Some key terms: ZooKeeper,Persisted and Ad-Hoc Scriptshttps://community.hortonworks.com/articles/797/hdinsight-deployment-best-practices.html
- Connecting HDInsight Clusters with Virtual Networks
-
-
-
Deploy and secure multi-user HDInsight clusters
-
Provision users who have different roles; manage users, groups, and permissions through Apache Ambari, PowerShell, and Apache Ranger; configure Kerberos; configure service accounts; implement SSH tunneling; restrict access to data
-
-
- Configure Domain-Joined HDInsight Clusters
You should be owner of your subsciption and your sunscription must have access to make premium clusters to proceed
- Manage Domain-joined HDInsight clusters
- Configure Hive policies in Domain-joined HDInsight
- Configure Domain-joined HDInsight clusters (Preview) using Azure PowerShell
- Video – Configure AAD, and create HDInsight cluster
- Powershell Script for Domain-Joining HDInsight clusters
- Video – Ignite 2016: Secure your Enterprise Hadoop environments on Azure
- Configure Domain-Joined HDInsight Clusters
-
-
-
Ingest data for batch and interactive processing
-
Ingest data from cloud or on-premises data; store data in Azure Data Lake; store data in Azure Blob Storage; perform routine small writes on a continuous basis using Azure CLI tools; ingest data in Apache Hive and Apache Spark by using Apache Sqoop, Application Development Framework (ADF), AzCopy, and AdlCopy; ingest data from an on-premises Hadoop cluster
-
-
- Collecting and loading data into HDInsight
A very informative link, definately read material. Have read it over the top, but I think is a very good resource even for future usage.
Key terms: Mahout ML lib for collaborative filtering, Oozie to form multi-step workflows - Upload Data Using Command Line Tools
Powershell Script Used for the tutorial
Further work can be done to Mount Blob storage as local drive. and Hadoop CLI
Some new things encountered: Significance of 0 byte files and / character in blob storage - Using Sqoop with HDInsight through SSH
‘Adaptive Server Connection Failed’ FreeTDS error. Create a SQL database on the fly while creating the cluster to bypass it. From the CMD, the error comes when you create a new database.
- Use Oozie with Hadoop to Define and Run a Workflow
The uploading files script doesn’t work, used storage explorer instead to do the job.
Now obsolete, as hdi has moved from win to linux.
Follow this tutorial instead - Extra Tutorial on analysis using Hive and HDI and HDFS
- Collecting and loading data into HDInsight
-
-
-
Configure HDInsight clusters
-
Manage metastore upgrades; view and edit Ambari configuration groups; view and change service configurations through Ambari; access logs written to Azure Table storage; enable heap dumps for Hadoop services; manage HDInsight configuration, use HDInsight .NET SDK, and PowerShell; perform cluster-level debugging; stop and start services through Ambari; manage Ambari alerts and metrics
-
-
- Customize HDInsight Clusters with Bootstrap Configurations
- Monitoring HDInsight with Ambari API
REST API to comm. with Ambari for info
- Manage HDInsight Using Azure CLI
-
-
-
Manage and debug HDInsight jobs
-
Describe YARN architecture and operation; examine YARN jobs through ResourceManager UI and review running applications; use YARN CLI to kill jobs; find logs for different types of jobs; debug Hadoop and Spark jobs; use Azure Operations Management Suite (OMS) to monitor and manage alerts, and perform predictive actions
-
-
- Manage Hadoop Clusters in HDInsight with the Ambari Portal
- Manage HDInsight Clusters wtih the Ambari Web UI
Important, implements SSH Tunneling
Ambari and Hive Views - Video: Manage and Troubleshoot Infrastructure with Operations Management Suite
- GitHub Scripts to Monitor HBase Clusters with OMS
-
-
2) Implement Big Data Batch Processing Solutions
-
Implement batch solutions with Hive and Apache Pig
-
Define external Hive tables; load data into a Hive table; use partitioning and bucketing to improve Hive performance; use semi-structured files such as XML and JSON with Hive; join tables with Hive using shuffle joins and broadcast joins; invoke Hive UDFs with Java and Python; design scripts with Pig; identify query bottlenecks using the Hive query graph; identify the appropriate storage format, such as Apache Parquet, ORC, Text, and JSON
- Hive tables
- Use partitioning and bucketing to improve Hive performance
More on partitioning
Bucketing and Partitioning examples Bucketing vs Partitioning More on ORC
More on Vectorisation - Use semi-structured files such as XML and JSON with Hive
XML Files not done
JSoN SerDe not done - Join tables with Hive using shuffle joins and broadcast joins
- Invoke Hive UDFs with Java and Python
Java UDF format : </br>
public class ExampleUDF extends UDF { // Accept a string input public String evaluate(String input) { // If the value is null, return a null if(input == null) return null; // Lowercase the input string and return it return input.toLowerCase(); } }
Hive Query
ADD JAR wasbs:///example/jars/ExampleUDF-1.0-SNAPSHOT.jar; CREATE TEMPORARY FUNCTION tolower as 'com.microsoft.examples.ExampleUDF'; SELECT tolower(deviceplatform) FROM hivesampletable LIMIT 10;
-
- Hive QL Intro
New terms: LLAP to speed up HQL in Hive 2.0, Apache Tez
- Hive Queries with Powershell
Also has the list of HDI Commands for Powershell, Important for Exam
Get-Credential
to get username and passwd from powershell script.Here-Strings
for complex HQL queries.
PS Script used for the tutorial - Hive Queries with Interactive Hive View
Visualisation tools offer a good insight
- Connecting Excel ODBC to Hive
Install both 32 and 64 bit drivers, to avoid “Application Driver Mismatch” Error.
- JSON SerDe with Hive
mvn package
doesn’t get tars in target, some probelms are there.
Try it on linux later - JSON UDFs
Not clear on import command
New Terms: Lateral View in Hive - Use Pig with HDInsight
- Use Python with Hive and Pig in HDInsight
Hive Query used and Python Script used
Pig Script used and Python Script used
New concepts: Jython, Pig runs on a native JVM, Doinghdfs dfs -put file /file
actually uploads it towasbs:///
Powershell and C Python Part of tut left, seems trivial and should be done via revising - Use Java UDFs for Pig and Hive (1) and (2)
- HDInsight Tools for Visual Studio
Also attempt Hive with VS
Extra Links
- Sensor Data Analysis using Hive
Try hands on partitioning here !
- Flight Delay Data using Hive
Try hands on bucketing here !
- Nice presentation on Hive Optimisation
- Hive QL Intro
-
-
-
Design batch ETL solutions for big data with Spark
-
Share resources between Spark applications using YARN queues and preemption, select Spark executor and driver settings for optimal performance, use partitioning and bucketing to improve Spark performance, connect to external Spark data sources, incorporate custom Python and Scala code in a Spark DataSets program, identify query bottlenecks using the Spark SQL query graph
-
-
Operationalize Hadoop and Spark
-
Create and customize a cluster by using ADF; attach storage to a cluster and run an ADF activity; choose between bring-your-own and on-demand clusters; use Apache Oozie with HDInsight; choose between Oozie and ADF; share metastore and storage accounts between a Hive cluster and a Spark cluster to enable the same table across the cluster types; select an appropriate storage type for a data pipeline, such as Blob storage, Azure Data Lake, and local Hadoop Distributed File System (HDFS)
-
3) Implement Big Data Interactive Processing Solutions
-
Implement interactive queries for big data with Spark SQL
-
Execute queries using Spark SQL, cache Spark DataFrames for iterative queries, save Spark DataFrames as Parquet files, connect BI tools to Spark clusters, optimize join types such as broadcast versus merge joins, manage Spark Thrift server and change the YARN resources allocation, identify use cases for different storage types for interactive queries
-
-
Perform exploratory data analysis by using Spark SQL
-
Use Jupyter and Apache Zeppelin for visualization and developing tidy Spark DataFrames for modeling, use Spark SQL’s two-table joins to merge DataFrames and cache results, save tidied Spark DataFrames to performant format for reading and analysis (Apache Parquet), manage interactive Livy sessions and their resources
-
-
- Use Livy to Submit Spark Jobs Remotely
- Use Zeppelin Notebooks with HDInsight Spark Clusters
Scala/Java class structure: GroupId:ArtifactId:Version
- Use Jupyter Notbeooks with HDInsight Spark Clusters
- Adding External Packages to Jupyter Notebooks in HDInsight Spark Clusters
%%configure
magic configures the underlying livy session to use the package you provided
-
-
-
Implement interactive queries for big data with Interactive Hive
-
Enable Hive LLAP through Hive settings, manage and configure memory allocation for Hive LLAP jobs, connect BI tools to Interactive Hive clusters
-
-
- Enable Hive LLAP through Hive settings
- Manage and configure memory allocation for Hive LLAP jobs
Through Ambari->Hive->Configs->Interactive Query
- Connect BI tools to Interactive Hive clusters
- Perform interactive querying and visualization
- Use Ambari Views
- Use HiveQL
- Parse CSV files with Hive
CREATE TABLE TAB_NAME (COL1 COL_TYPE1 COL2 COL_TYPE2) ROW FORMAT DELIMITED FIELDS TERMIBATED BY LOAD DATA LOCAL INPATH 'wasbs://yourcsvfile.csv' INTO TABLE TAB_NAME
- Use ORC versus Text for caching
CREATE TABLE IF NOT EXISTS TAB_NAME (COL1 COL_TYPE1 COL2 COL_TYPE2) STORED AS ORC
- Use internal and external tables in Hive
- Use Zeppelin to visualize data
-
-
-
Perform exploratory data analysis by using Hive
-
Perform interactive querying and visualization, use Ambari Views, use HiveQL, parse CSV files with Hive, use ORC versus Text for caching, use internal and external tables in Hive, use Zeppelin to visualize data
- Useful Links
-
-
Perform interactive processing by using Apache Phoenix on HBase
-
Use Phoenix in HDInsight; use Phoenix Grammar for queries; configure transactions, user-defined functions, and secondary indexes; identify and optimize Phoenix performance; select between Hive, Spark, and Phoenix on HBase for interactive processing; identify when to share metastore between a Hive cluster and a Spark cluster
-
- Useful Links
- Use Phoenix with HBase Clusters
The correct command to run phoenix queries is
/usr/hdp//phoenix/bin/sqlline.py zookeeper_host:2181:/hbase-unsecure
and not the one mentioned in the tutorials. - Phoenix Grammar
- Bulk Import in Phoenix
- Use Phoenix with HBase Clusters
- Useful Links
-
4) Implement Big Data Real-Time Processing Solutions
-
Create Spark streaming applications using DStream API
-
Define DStreams and compare them to Resilient Distributed Dataset (RDDs), start and stop streaming applications, transform DStream (flatMap, reduceByKey, UpdateStateByKey), persist long-term data stores in HBase and SQL, persist Long Term Data Azure Data Lake and Azure Blob Storage, stream data from Apache Kafka or Event Hub, visualize streaming data in a PowerBI real-time dashboard
-
-
Create Spark structured streaming applications
-
Use DataFrames and DataSets APIs to create streaming DataFrames and Datasets; create Window Operations on Event Time; define Window Transformations for Stateful and Stateless Operations; stream Window Functions, Reduce by Key, and Window to Summarize Streaming Data; persist Long Term Data HBase and SQL; persist Long Term Data Azure Data Lake and Azure Blob Storage; stream data from Kafka or Event Hub; visualize streaming data in a PowerBI real-time dashboard
-
-
Develop big data real-time processing solutions with Apache Storm
-
Create Storm clusters for real-time jobs, persist Long Term Data HBase and SQL, persist Long Term Data Azure Data Lake and Azure Blob Storage, stream data from Kafka or Event Hub, configure event windows in Storm, visualize streaming data in a PowerBI real-time dashboard, define Storm topologies and describe Storm Computation Graph Architecture, create Storm streams and conduct streaming joins, run Storm topologies in local mode for testing, configure Storm applications (Workers, Debug mode), conduct Stream groupings to broadcast tuples across components, debug and monitor Storm jobs
-
-
Build solutions that use Kafka
-
Create Spark and Storm clusters in the virtual network, manage partitions, configure MirrorMaker, start and stop services through Ambari, manage topics
-
-
Build solutions that use HBase
-
Identify HBase use cases in HDInsight, use HBase Shell to create updates and drop HBase tables, monitor an HBase cluster, optimize the performance of an HBase cluster, identify uses cases for using Phoenix for analytics of real-time data, implement replication in HBase
-
-
- Getting Started with HBase on HDInsight
- Adding HBase to Azure Virtual Network
- Configuring HBase Replication
- Real Time Processing with HBase
Do Using Storm and HBase for sensor data to revise, instead of the twitter one, will clear concepts of both Storm and HBase.
- More on HBase
-
-
Useful links:
-
Spark Interview Questions
-
Spark Streaming
-
https://github.com/Azure/learnAnalytics-public/blob/master/HDInsight/Community-Guide-775.md
-
https://learnanalytics.microsoft.com/home/certifications
-
https://azure.microsoft.com/en-us/documentation/learning-paths/hdinsight-self-guided-hadoop-training/
-
https://blogs.msdn.microsoft.com/cindygross/2015/02/04/understanding-wasb-and-hadoop-storage-in-azure/
-
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage#hdinsight-storage-architecture
-
http://www.cs.virginia.edu/~hs6ms/publishedPaper/Conference/2016/Scale-up-out-Cloud2016.pdf
Posted in Big Data, Certifications, Hadoop, Spark | Tagged: Big data, Big Data Certification, Exam 70-775, Spark Exam | Leave a Comment »
Exam 70-475 (Designing and Implementing Big Data Analytics Solutions)
Posted by Pramod Singla on December 1, 2016
Skills Measured – Some Links
-
-
Design big data batch processing and interactive solutions (20-25%)
- Ingest data for batch and interactive processing, design and provision compute clusters, design for data security, design for batch processing, design interactive queries for big data
- https://azure.microsoft.com/en-us/documentation/services/data-lake-store/
- https://azure.microsoft.com/en-us/documentation/services/sql-data-warehouse/
- https://azure.microsoft.com/en-us/documentation/articles/batch-api-basics/
- https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/
Design big data real-time processing solutions (25-30%)
- Ingest data for real-time processing, design and provision compute resources, design for lambda architecture, design for real-time processing
-
- https://azure.microsoft.com/en-us/documentation/services/stream-analytics/
- https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-introduction/#overview and following the learning guide: https://azure.microsoft.com/en-us/documentation/learning-paths/hdinsight-self-guided-hadoop-training/ – remember you need to appreciate Spark and Storm
Design machine-learning solutions (25-30%)
- Create and manage experiments, determine when to pre-process or train inside Machine Learning Studio, select input/output types, apply custom processing steps with R and Python, publish web services
- This is the one resource that will probably tip you over to passing this section. Fudge, I should have read it with more gusto, especially since the author is a fellow CSA and it is an easy read: http://blogs.msdn.com/b/microsoft_press/archive/2015/04/15/free-ebook-microsoft-azure-essentials-azure-machine-learning.aspx
- https://azure.microsoft.com/en-us/documentation/articles/machine-learning-what-is-ml-studio/
- https://azure.microsoft.com/en-us/documentation/articles/machine-learning-publish-a-machine-learning-web-service/
- https://azure.microsoft.com/en-us/documentation/articles/machine-learning-retrain-models-programmatically/
Operationalize end-to-end cloud analytics solutions (25-30%)
- Create a data factory, orchestrate data processing activities in a data-driven workflow, monitor and manage the data factory, move, transform, and analyze data
- https://azure.microsoft.com/en-us/documentation/learning-paths/data-factory/ ß I love these learning paths, this is a good one
- https://azure.microsoft.com/en-us/documentation/articles/data-factory-monitor-manage-pipelines/
- https://azure.microsoft.com/en-us/documentation/articles/data-factory-data-movement-activities/
- https://azure.microsoft.com/en-us/documentation/articles/data-factory-data-processing-using-batch/
- Ingest data for batch and interactive processing, design and provision compute clusters, design for data security, design for batch processing, design interactive queries for big data
-
Useful Videos:https://mva.microsoft.com/en-US/training-courses/design-and-implement-big-data-advanced-analytics-solutions-16531?l=9OzFyzWdC_5006218965
Other links:https://www.microsoft.com/en-us/learning/exam-70-475.aspxhttps://www.microsoft.com/en-us/learning/exam-70-475.aspx
Posted in Azure, Big Data, Certifications, Hadoop, Microsoft Exam 70-475 | Tagged: 70-475, Azure data lake, Azure Hdinsight, Big data, Big data exam, Exam 70-475 | 3 Comments »