Cloudera Developer Training for Apache Spark

$2,495.00


  • Classroom

  • Onsite

Duration: 3 Days

Overview

Apache Spark is the next-generation successor to MapReduce. Spark is a powerful, open source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs.

In this course, you will build complete, unified big data applications that combine batch, streaming, and interactive analytics on all data. You will learn how to use Spark to write sophisticated parallel applications for faster decisions, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries.

What You'll Learn

  • Use the Spark shell for interactive data analysis
  • Features of Spark's Resilient Distributed Datasets
  • Fundamentals of running Spark on a cluster
  • Parallel programming with Spark
  • Write Spark applications
  • Process streaming data with Spark

Who Needs to Attend

Developers and software engineers

Prerequisites

  • Some programming experience (Python and Scala suggested)
  • Basic knowledge of Linux
  • Knowledge of Hadoop not required

Follow-On Courses

There are no follow-ons for this course.

Course Outline

1. Why Spark?

  • Problems with Traditional Large-Scale Systems
  • Introducing Spark

2. Spark Basics

  • What is Apache Spark?
  • Using the Spark Shell
  • Resilient Distributed Datasets (RDDs)
  • Functional Programming with Spark

3. Working with RDDs

  • RDD Operations
  • Key-Value Pair RDDs
  • MapReduce and Pair RDD Operations

4. The Hadoop Distributed File System

  • Why HDFS?
  • HDFS Architecture
  • Using HDFS

5. Running Spark on a Cluster

  • A Spark Standalone Cluster
  •  The Spark Standalone Web UI

6. Parallel Programming with Spark

  • RDD Partitions and HDFS Data Locality
  • Working with Partitions
  • Executing Parallel Operations

7. Caching and Persistence

  • RDD Lineage
  • Caching Overview
  • Distributed Persistence

8. Writing Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Configuring Spark Properties
  • Building and Running a Spark Application
  • Logging

9. Spark, Hadoop, and the Enterprise Data Center

  • Spark and the Hadoop Ecosystem
  • Spark and MapReduce

10. Spark Streaming

  • Example: Streaming Word Count
  • Other Streaming Operations
  • Sliding Window Operations
  • Developing Spark Streaming Applications

11. Common Spark Algorithms

  • Iterative Algorithms
  • Graph Analysis
  • Machine Learning

12. Improving Spark Performance

  • Shared Variables: Broadcast Variables
  • Shared Variables: Accumulators
  • Common Performance Issues