Apache Spark

Course Overview

Apache Spark is an open-source, distributed computing system that is designed for fast, flexible, and expressive data processing. Spark is a general-purpose data processing engine that can handle a wide variety of data types and processing workloads, including batch processing, stream processing, machine learning, and graph processing. Training in Apache Spark typically covers the fundamental concepts and architecture of Spark, as well as how to develop and deploy Spark-based applications. It may also cover topics such as Spark streaming, Spark SQL, and Spark machine learning.

At the end of the training, participants will be able to:

1.Understand the architecture of Spark and explain its business use cases
2.Distribute, store, and process data using RDDs in a Hadoop cluster
3.Use Spark SQL for querying DBs
4.Write, configure, and deploy Spark applications on a cluster
5.Use the Spark shell for interactive data analysis
6.Process and query structured data using Spark SQL

Pre-requisite

Knowledge of Apache Hadoop ecosystem, SQL, Linux CLI and Scala is required.

Duarion

2 days

Course Outline

Apache Spark Basics

What is Apache Spark?
Starting the Spark Shell
Using the Spark Shell
Getting Started with Datasets and DataFrames
DataFrame Operations

Working with DataFrames and Schemas

Working with DataFrames and Schemas
Creating DataFrames from Data Sources
Saving DataFrames to Data Sources
DataFrame Schemas
Eager and Lazy Execution
Analyzing Data with DataFrame Queries
Querying DataFrames Using
Column Expressions
Grouping and Aggregation Queries
Joining DataFrames

RDD Overview

RDD Overview
RDD Data Sources
Creating and Saving RDDs
RDD Operations

Transforming Data with RDDs

Writing and Passing
Transformation Functions
Transformation Execution
Converting Between RDDs and DataFrames

Querying Tables and Views with Apache Spark SQL

Querying Tables in Spark Using SQL
Querying Files and Views
The Catalog API
Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark

Writing, Configuring, and Running

Apache Spark Applications
Writing a Spark Application
Building and Running an Application
Application Deployment Mode
The Spark Application Web UI
Configuring Application Properties

Distributed Processing

Review: Apache Spark on a Cluster
RDD Partitions
Example: Partitioning in Queries
Stages and Tasks
Job Execution Planning
Example: Catalyst Execution Plan
Example: RDD Execution Plan

Common Patterns in Apache Spark

Data Processing
Common Apache Spark Use Cases
Iterative Algorithms in Apache Spark
Machine Learning
Example: k-means

+91-81029 35454

info@greaterinsights.in

GREATERINSIGHTS LLP

Apache Spark

Course Overview

At the end of the training, participants will be able to:

Pre-requisite

Duarion

Course Outline

Reviews

EXPLORE

All Courses

About Us

Privacy Policy

Resources

Terms & Conditions

LOCATION

GET IN TOUCH!

768, 14th Cross Rd, 2nd Stage, Kumaraswamy Layout, Bengaluru, Karnataka 560078

+91-81029 35454

info@greaterinsights.in

Need help with Corporate Training?

© Copyright 2025 by GREATERINSIGHTS LLP. All rights Reserved