Overview
This training will introduce you to the world of Hadoop and MapReduce. You will learn through a series of practical, hands on exercises on writing complex MapReduce transformations, about HDFSand writing scripts using the advanced features of Pig. You will understand the Hive environment, the Hive querying language and how to perform data analysis with Hive.
Objectives
At the end of Apache Pig & Hive training course, participants will learn
Prerequisites
- Understanding of Linux commands and SQL queries
- Basic Knowledge of core Java
Course Outline
- Hadoop overview
- Surveying the Hadoop components
- Defining the Hadoop architecture
Storing data in HDFS
- Achieving reliable and secure storage
- Monitoring storage metrics
- Controlling HDFS from the Command Line
Parallel processing with MapReduce
- Detailing the MapReduce approach
- Transferring algorithms not data
- Dissecting the key stages of a MapReduce job
Automating data transfer
- Facilitating data Ingress and Egress
- Aggregating data with Flume
- Configuring data fan in and fan out
- Moving relational data with Sqoop
- Contrasting Pig with MapReduce
- Identifying Pig use cases
- Pinpointing key Pig configurations
- Pig Latin: Relational Operators
- File Loaders
- Group Operator
- CO GROUP Operator
- Joins and CO GROUP
- Union, Diagnostic Operators
- Pig UDF
Structuring unstructured data
- Representing data in Pig’s data model
- Running Pig Latin commands at the Grunt Shell
- Expressing transformations in Pig Latin Syntax
- Invoking Load and Store functions
Transforming data with Relational Operators
- Creating new relations with joins
- Reducing data size by sampling
- Extending Pig with user–defined functions
Filtering data with Pig
- Consolidating data sets with unions
- Partitioning data sets with splits
- Injecting parameters into Pig scripts
- Hive Background
- Hive Use Case
- About Hive
- Hive vs Pig
- Hive Architecture and Components
- Meta-store in Hive
- Limitations of Hive
- Comparison with Traditional Database
- Hive Data Types and Data Models
- Partitions and Buckets
- Hive Tables(Managed Tables and External Tables)
- Importing Data
- Querying Data
- Managing Outputs
- Hive Script
- Hive UDF and Hive Demo on Healthcare Data set
- Hive QL: Joining Tables
- Dynamic Partitioning
- Custom MapReduce Scripts
- Thrift Server
- User Defined Functions