BIG DATA ANALYTICS TRAINING
Course Overview
Today across the world, organizations are inundated with huge amounts of data from all directions – and to make the best use of it, they must be able to harness all relevant data and analyze it to make the best decisions to transform their business. With this explosion in data, Hadoop has gained in significance as organizations worldwide have found Hadoop to be the best platform for managing and processing big data.
To make the most efficient use of the Hadoop platform, and fully analyze and utilize every bit of data for maximum productivity, training is of paramount importance. Trained Hadoop Data Analysts are much in demand as they will be able to leverage best practices to work with big data faster and more effectively.
Our Hadoop Data Analyst course is for those who wish to access, manipulate, and analyze massive data sets using SQL and familiar scripting languages on Hadoop. Learn how to transform data using Apache Pig, Apache Hive, and Cloudera Impala and analyze it using filters, joins, and user-defined functions familiar from other technologies.
At the end of the training, participants will be able to:
- Basics of Apache Hadoop and data ETL (extract, transform, load), ingestion, and processing with Hadoop tools
- How to join multiple data sets and analyze disparate data with Pig
- How to organize data into tables, perform transformations, and simplify complex queries with Hive
- How to perform real-time interactive analyses on massive data sets stored in HDFS or HBase using SQL with Impala
- How to pick the best tool for a given task in Hadoop, achieve interoperability, and manage workflows that are repetitive
Pre-requisite
- Basics of Apache Hadoop and data ETL (extract, transform, load), ingestion, and processing with Hadoop tools
- How to join multiple data sets and analyze disparate data with Pig
- How to organize data into tables, perform transformations, and simplify complex queries with Hive
- How to perform real-time interactive analyses on massive data sets stored in HDFS or HBase using SQL with Impala
- How to pick the best tool for a given task in Hadoop, achieve interoperability, and manage workflows that are repetitive
Duration
5 days
Course Outline
- What is Big Data
- Data Analytics
- Big Data Challenges
- Technologies supported by big data
• What is Hadoop?
• History of Hadoop
• Basic Concepts
• Future of Hadoop
• The Hadoop Distributed File System
• Anatomy of a Hadoop Cluster
• Breakthroughs of Hadoop
• Hadoop Distributions:
• Apache Hadoop
• Cloudera Hadoop
• Horton Networks Hadoop
• MapR Hadoop
- Name Node
- Data Node
- Secondary Name Node
- Job Tracker
- Task Tracker
- Blocks and Input Splits
- Data Replication
- Hadoop Rack Awareness
- Cluster Architecture and Block Placement
- Accessing HDFS
- JAVA Approach
- CLI Approach
- Local Mode
- Pseudo-distributed Mode
- Fully distributed mode
- Pseudo Mode installation and configurations
- HDFS basic file operations
- Basic API Concepts
- The Driver Class
- The Mapper Class
- The Reducer Class
- The Combiner Class
- The Partitioner Class
- Examining a Sample MapReduce Program with several examples
- Hadoop’s Streaming API
- Pig
- Hive
- SQOOP
- HBASE
- OOZIE
- FLUME
- MapReduce and HIVE integration
- MapReduce and HBASE integration
- Java and HIVE integration
- HIVE – HBASE Integration