Hadoop Development and Admin
Table of Contents
Introduction to HADOOP
- Distributed computing, cloud computing
- Big data Basics and Need for Parallel Processing
- How Hadoop works?
- Introduction to HDFS and Map Reduce
Hadoop Architecture Details
- Name Node
- Data Node
- Secondary Name Node
- Job Tracker
- Task Tracker
HDFS (Hadoop – Distributed File System)
- Hadoop Distributed file system, Background, GFS
- Data Replication
- Data Storage
- Data Retrieval
- Additional HDFS commands
MapReduce Programming
- MapReduce, Background
- Writing MapReduce Programs
- Writable and Writable Comparable
- Input Format, Output Format
- Input Split and Block size
- Combiner
- Partitioner
- Number of Mappers and Reducers
- Counters
Map Reduce Algorithms and Exercises
- Line Count and Word Count
- Distributed Search
- Sorting Data – Key Value Data Type
- Mathematical Transformation example
- working with Counters exercise
- Distributed Cache exercise
- Zero Reducer based exercises
Hadoop Streaming
- Introduction to Hadoop Streaming
- Streaming API details and use cases
- Python Based Example for Streaming API
- Exercise for Hadoop Streaming (XML Files) Based.
- Exercises on Ruby
- Exercise on C# using MS-Azure.
Apache Pig
- Installation
- Execution Types
- Grunt Shell
- Pig Latin
- Data Processing
- Loading and Storing
- Data Filtering
- Grouping & Joining Operations
- Hands on Exercises
Apache HBase Installation and Details
- HBase and NOSQL Introduction
- HBase Installation and Configuration.
- HBase and Java Based integration
- HBase Hadoop Integration Details.
- Hbase basic exercises
Apache Hive Installation and Details
- Hive Installation on Single cluster Hadoop Node.
- Hive Services
- Hive Shell Description
- Hive Server·
- Meta store Details
- Hive QL Basics
- Working with Tables, Databases etc.
- Hive JDBC programming
- Hands on Exercises and Assignments
Hadoop Infrastructure Planning
- Basic Hadoop hardware and software requirements.
- Small, Medium and Large cluster
- Networking challenges in Hadoop Deployment
- Disaster Recovery (DR) in Hadoop .
- Performance Tuning a large cluster
Hadoop Industry Solutions
- EMC GreenPlum Introduction
- IBM Big Insight Details
- Oracle, Microsoft etc. Hadoop Offerings
- Cloudera and Horton Works Hadoop Package
Hadoop and Cloud Computing
- Using Cloud technologies for distributed processing
- Hadoop on Amazon Web Service.
- Hadoop in Oracle Cloud / Rackspace
Apache Spark Training
Introduction to Spark
- What is Spark?
- Review: From Hadoop MapReduce to Spark
- Introduction: HDFS
- Introduction: YARN
- Introduction: Mesos
- Spark Overview
Spark Basics
- Using the Spark Shell
- RDDs (Resilient Distributed Datasets)
- Functional Programming in Spark
Working with RDDs in Spark
- Creating RDDs
- Other General RDD Operations
Aggregating Data with Pair RDDs
- Key-Value Pair RDDs
- Map-Reduce
- Other Pair RDD Operations
Writing and Deploying Spark Applications
- Spark Applications vs. Spark Shell
- Creating the SparkContext
- Building a Spark Application (Scala and Java)
- Running a Spark Application
- The Spark Application Web UI
- Hands-On Exercise: Write and Run a Spark Application
- Configuring Spark Properties
- Logging
Parallel Processing
- Review: Spark on a Cluster
- RDD Partitions
- Partitioning of File-based RDDs
- HDFS and Data Locality
- Executing Parallel Operations
- Stages and Tasks
Basic Spark Streaming
- Spark Streaming Overview
- Example: Streaming Request Count
- DStreams
- Developing Spark Streaming Applications
Advanced Spark Streaming
- Multi-Batch Operations
- State Operations
- Sliding Window Operations
- Advanced Data Sources
Common Patterns in Spark Data Processing
- Common Spark Use Cases
- Iterative Algorithms in Spark
- Graph Processing and Analysis
- Machine Learning
- Example: k-means
Improving Spark Performance
- Shared Variables: Broadcast Variables
- Shared Variables: Accumulators
- Common Performance Issues
- Diagnosing Performance Problems
Spark SQL and DataFrames
- Spark SQL and the SQL Context
- Creating DataFrames
- Transforming and Querying DataFrames
- Saving DataFrames
- DataFrames and RDDs
- Comparing Spark SQL, Impala and Hive-on-Spark
Spark Mlib ( Machine Learning)
- Basic Principles of Machine Learning
- Spark ML API Patterns
- Built-in Featurizing and Algorithm APIs
- Transaformation, Corelation Algorithm.