Overview
With this Big Data Hadoop course, you will learn the big data framework using Hadoop and Spark, including HDFS, YARN, and MapReduce. The course will also cover Pig, Hive, and Impala to process and analyse large datasets stored in the HDFS and use Sqoop and Flume for data ingestion.
Objectives
At the end of BDHS training, participants will be able to understand:
Prerequisites
- There are no prerequisites for this course. However, it’s beneficial to have some knowledge of Core Java and SQL.
Course Outline
- Apache Hadoop Overview
- Data Processing
- Introduction to the Hands-On Exercises
- Apache Hadoop Cluster Components
- HDFS Architecture
- Using HDFS
- YARN Architecture
- Working With YARN
- What is Apache Spark?
- Starting the Spark Shell
- Using the Spark Shell
- Getting Started with Datasets and DataFrames
- DataFrame Operations
- Creating DataFrames from Data Sources
- Saving DataFrames to Data Sources
- DataFrame Schemas
- Eager and Lazy Execution
- Querying DataFrames Using Column Expressions
- Grouping and Aggregation Queries
- Joining DataFrames
- RDD Overview
- RDD Data Sources
- Creating and Saving RDDs
- RDD Operations
- Writing and Passing Transformation Functions
- Transformation Execution
- Converting Between RDDs and DataFrames
- Key-Value Pair RDDs
- Map-Reduce
- Other Pair RDD Operations
- Datasets and DataFrames
- Creating Datasets
- Loading and Saving Datasets
- Dataset Operations
- Writing a Spark Application
- Building and Running an Application
- Application Deployment Mode
- The Spark Application Web UI
- Configuring Application Properties
- Review: Apache Spark on a Cluster
- RDD Partitions
- Example: Partitioning in Queries
- Stages and Tasks
- Job Execution Planning
- Example: Catalyst Execution Plan
- Example: RDD Execution Plan
- Apache Spark Streaming Overview
- Creating Streaming DataFrames
- Transforming DataFrames
- Executing Streaming Queries
- Receiving Kafka Messages
- Sending Kafka Messages