Hadoop Development and Admin

Table of Contents

Introduction to HADOOP

  • Distributed computing, cloud computing
  • Big data Basics and Need for Parallel Processing
  • How Hadoop works?
  • Introduction to HDFS and Map Reduce

Hadoop Architecture Details

  • Name Node
  • Data Node
  • Secondary Name Node
  • Job Tracker
  • Task Tracker

HDFS (Hadoop – Distributed File System)

  • Hadoop Distributed file system, Background, GFS
  • Data Replication
  • Data Storage
  • Data Retrieval
  • Additional HDFS commands

 

MapReduce Programming

  • MapReduce, Background
  • Writing MapReduce Programs
  • Writable and Writable Comparable
  • Input Format, Output Format
  • Input Split and Block size
  • Combiner
  • Partitioner
  • Number of Mappers and Reducers
  • Counters

Map Reduce Algorithms and Exercises

  • Line Count and Word Count
  • Distributed Search
  • Sorting Data – Key Value Data Type
  • Mathematical Transformation example
  • working with Counters exercise
  • Distributed Cache exercise
  • Zero Reducer based exercises

Hadoop Streaming

  • Introduction to Hadoop Streaming
  • Streaming API details and use cases
  • Python Based Example for Streaming API
  • Exercise for Hadoop Streaming (XML Files) Based.
  • Exercises on Ruby
  • Exercise on C# using MS-Azure.

Apache Pig

  • Installation
  • Execution Types
  • Grunt Shell
  • Pig Latin
  • Data Processing
  • Loading and Storing
  • Data Filtering
  • Grouping & Joining Operations
  • Hands on Exercises

Apache HBase Installation and Details

  • HBase and NOSQL Introduction
  • HBase Installation and Configuration.
  • HBase and Java Based integration
  • HBase Hadoop Integration Details.
  • Hbase basic exercises

Apache Hive Installation and Details

  • Hive Installation on Single cluster Hadoop Node.
  • Hive Services
  • Hive Shell Description
  • Hive Server·
  • Meta store Details
  • Hive QL Basics
  • Working with Tables, Databases etc.
  • Hive JDBC programming
  • Hands on Exercises and Assignments

Hadoop Infrastructure Planning

  • Basic Hadoop hardware and software requirements.
  • Small, Medium and Large cluster
  • Networking challenges in Hadoop Deployment
  • Disaster Recovery (DR) in Hadoop .
  • Performance Tuning a large cluster

Hadoop Industry Solutions

  • EMC GreenPlum Introduction
  • IBM Big Insight Details
  • Oracle, Microsoft etc. Hadoop Offerings
  • Cloudera and Horton Works Hadoop Package

Hadoop and Cloud Computing

  • Using Cloud technologies for distributed processing
  • Hadoop on Amazon Web Service.
  • Hadoop in Oracle Cloud / Rackspace

Apache Spark Training

Introduction to Spark

  • What is Spark?
  • Review: From Hadoop MapReduce to Spark
  • Introduction: HDFS
  • Introduction: YARN
  • Introduction: Mesos
  • Spark Overview

Spark Basics

  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark

Working with RDDs in Spark

  • Creating RDDs
  • Other General RDD Operations

Aggregating Data with Pair RDDs

  • Key-Value Pair RDDs
  • Map-Reduce
  • Other Pair RDD Operations

Writing and Deploying Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Building a Spark Application (Scala and Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Hands-On Exercise: Write and Run a Spark Application
  • Configuring Spark Properties
  • Logging

Parallel Processing

  • Review: Spark on a Cluster
  • RDD Partitions
  • Partitioning of File-based RDDs
  • HDFS and Data Locality
  • Executing Parallel Operations
  • Stages and Tasks

Basic Spark Streaming

  • Spark Streaming Overview
  • Example: Streaming Request Count
  • DStreams
  • Developing Spark Streaming Applications

Advanced Spark Streaming

  • Multi-Batch Operations
  • State Operations
  • Sliding Window Operations
  • Advanced Data Sources

Common Patterns in Spark Data Processing

  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
  • Example: k-means

Improving Spark Performance

  • Shared Variables: Broadcast Variables
  • Shared Variables: Accumulators
  • Common Performance Issues
  • Diagnosing Performance Problems

Spark SQL and DataFrames

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • DataFrames and RDDs
  • Comparing Spark SQL, Impala and Hive-on-Spark

Spark Mlib ( Machine Learning)

  • Basic Principles of Machine Learning
  • Spark ML API Patterns
  • Built-in Featurizing and Algorithm APIs
  • Transaformation, Corelation Algorithm.