Overview
Apache Spark and Scala course is designed to help you become proficient in Apache Spark Development. You will learn about topics such as Apache Spark Core, Motivation for Apache Spark, Spark Internals, RDD, SparkSQL, Spark Streaming, MLlib, and GraphX that form key constituents of the Apache Spark course.
Objectives
At the end of Apache Spark & Scala training course, participants will
Prerequisites
Hadoop Basics
Course Outline
- Overview of Hadoop
- Architecture of HDFS & YARN
- Overview of Spark version 2.2.0
- Spark Architecture
- Spark Components
- Comparison of Spark & Hadoop
- Installation of Spark v 2.2.0 on Linux 64 bit
- Exploring the Spark shell
- Creating Spark Context
- Operations on Resilient Distributed Dataset – RDD
- Transformations & Actions
- Loading Data and Saving Data
- Introduction to SQL Operations
- SQL Context
- Data Frame
- Working with Hive
- Loading Partitioned Tables
- Processing CSV, Json ,Parquet files
- Introduction to Scala
- Feature of Scala
- Scala vs Java Comparison
- Data types
- Data Structure
- Arrays
- Literals
- Logical Operators
- Mutable & Immutable variables
- Type interface
- Oops vs Functions
- Anonymous
- Recursive
- Call-by-name
- Currying
- Conditional statement
- List
- Map
- Sets
- Options
- Tuples
- Mutable collection
- Immutable collection
- Iterating
- Filtering and counting
- Group By
- Flat Map
- Word count
- File Access
- Classes, Objects & Properties
- Inheritance
- Maven build tool implementation
- Build Libraries
- Create Jar files
- Spark-Submit
- Overview of Spark Streaming
- Architecture of Spark Streaming
- File streaming
- Twitter Streaming
- Overview of Kafka Streaming
- Architecture of Kafka Streaming
- Kafka Installation
- Topic
- Producer
- Consumer
- File streaming
- Twitter Streaming
- Overview of Machine Learning Algorithm
- Linear Regression
- Logistic Regression
- GraphX overview
- Vertices
- Edges
- Triplets
- Page Rank
- Pregel
- On-Off-heap memory tuning
- Kryo Serialization
- Broadcast Variable
- Accumulator Variable
- DAG Scheduler
- Data Locality
- Check Pointing
- Speculative Execution
- Garbage Collection
- Master – Driver Node capacity
- Slave – Worker Node capacity
- Executor capacity
- Executor core capacity
- Project scenario and execution
- Out-of-memory error handling
- Master logs, Worker logs, Driver logs
- Monitoring Web UI
- Heap memory dump