Apache Spark

CloudLabs

Projects

Assignment

24x7 Support

Lifetime Access

.

Course Overview

Our Apache Spark certification training prepares to you master Spark SQL to query structured data and Spark Streaming to perform real-time processing on streaming data from a variety of sources. You also gain the skills required to work with large datasets stored in a distributed file system, and execute Spark applications on a Hadoop cluster at your organization.By the end of our Spark training course, you will gain a deep understanding of Spark architecture and what makes it better and faster than MapReduce. With easy to follow, step-by-step instructions, trainees learn how to create and operate on data frames from all their organization’s data sources.With Cloudlabs, the virtual lab environment, gain hands-on experience querying tables and views in Spark SQL. In our Apache Spark course, you will also learn how to write sophisticated parallel applications to execute faster decisions that is applicable to a wide variety of use cases.

At the end of the training, participants will be able to:

  1. Understand the architecture of Spark and explain its business use cases
  2. Distribute, store, and process data using RDDs in a Hadoop cluster
  3. Use Spark SQL for querying DBs
  4. Write, configure, and deploy Spark applications on a cluster
  5. Use the Spark shell for interactive data analysis
  6. Process and query structured data using Spark SQL

Pre-requisite

  1. Knowledge of Apache Hadoop ecosystem, SQL, Linux CLI and Scala is required.

Duration

2 days

Course Outline

  1. What is Apache Spark?
  2. Starting the Spark Shell
  3. Using the Spark Shell
  4. Getting Started with Datasets and DataFrames
  5. DataFrame Operations
  1. Working with DataFrames and Schemas
  2. Creating DataFrames from Data Sources
  3. Saving DataFrames to Data Sources
  4. DataFrame Schemas
  5. Eager and Lazy Execution
  6. Analyzing Data with DataFrame Queries
  7. Querying DataFrames Using
  8. Column Expressions
  9. Grouping and Aggregation Queries
  10. Joining DataFrames
  1. RDD Overview
  2. RDD Data Sources
  3. Creating and Saving RDDs
  4. RDD Operations
  1. Writing and Passing
  2. Transformation Functions
  3. Transformation Execution
  4. Converting Between RDDs and DataFrames
  1. Querying Tables in Spark Using SQL
  2. Querying Files and Views
  3. The Catalog API
  4. Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
  1. Apache Spark Applications
  2. Writing a Spark Application
  3. Building and Running an Application
  4. Application Deployment Mode
  5. The Spark Application Web UI
  6. Configuring Application Properties
  1. Review: Apache Spark on a Cluster
  2. RDD Partitions
  3. Example: Partitioning in Queries
  4. Stages and Tasks
  5. Job Execution Planning
  6. Example: Catalyst Execution Plan
  7. Example: RDD Execution Plan
  1. Data Processing
  2. Common Apache Spark Use Cases
  3. Iterative Algorithms in Apache Spark
  4. Machine Learning
  5. Example: k-means

Reviews