Apache Spark

Course Overview

Apache Spark is an open-source, distributed computing system that is designed for fast, flexible, and expressive data processing. Spark is a general-purpose data processing engine that can handle a wide variety of data types and processing workloads, including batch processing, stream processing, machine learning, and graph processing. Training in Apache Spark typically covers the fundamental concepts and architecture of Spark, as well as how to develop and deploy Spark-based applications. It may also cover topics such as Spark streaming, Spark SQL, and Spark machine learning.

At the end of the training, participants will be able to:

  • 1.Understand the architecture of Spark and explain its business use cases
  •  
  • 2.Distribute, store, and process data using RDDs in a Hadoop cluster
  •  
  • 3.Use Spark SQL for querying DBs
  •  
  • 4.Write, configure, and deploy Spark applications on a cluster
  •  
  • 5.Use the Spark shell for interactive data analysis
  •  
  • 6.Process and query structured data using Spark SQL

Pre-requisite

Knowledge of Apache Hadoop ecosystem, SQL, Linux CLI and Scala is required.

Duarion

2 days

Course Outline

  1. What is Apache Spark?
  2. Starting the Spark Shell
  3. Using the Spark Shell
  4. Getting Started with Datasets and DataFrames
  5. DataFrame Operations
  1. Working with DataFrames and Schemas
  2. Creating DataFrames from Data Sources
  3. Saving DataFrames to Data Sources
  4. DataFrame Schemas
  5. Eager and Lazy Execution
  6. Analyzing Data with DataFrame Queries
  7. Querying DataFrames Using
  8. Column Expressions
  9. Grouping and Aggregation Queries
  10. Joining DataFrames
  1. RDD Overview
  2. RDD Data Sources
  3. Creating and Saving RDDs
  4. RDD Operations
  1. Writing and Passing
  2. Transformation Functions
  3. Transformation Execution
  4. Converting Between RDDs and DataFrames
  1. Querying Tables in Spark Using SQL
  2. Querying Files and Views
  3. The Catalog API
  4. Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
  1. Apache Spark Applications
  2. Writing a Spark Application
  3. Building and Running an Application
  4. Application Deployment Mode
  5. The Spark Application Web UI
  6. Configuring Application Properties
  1. Review: Apache Spark on a Cluster
  2. RDD Partitions
  3. Example: Partitioning in Queries
  4. Stages and Tasks
  5. Job Execution Planning
  6. Example: Catalyst Execution Plan
  7. Example: RDD Execution Plan
  1. Data Processing
  2. Common Apache Spark Use Cases
  3. Iterative Algorithms in Apache Spark
  4. Machine Learning
  5. Example: k-means

Reviews