Cloudera Developer Training for Apache Spark

$2,495.00

Classroom
Onsite

Duration: 3 Days

Overview

Apache Spark is the next-generation successor to MapReduce. Spark is a powerful, open source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs.

In this course, you will build complete, unified big data applications that combine batch, streaming, and interactive analytics on all data. You will learn how to use Spark to write sophisticated parallel applications for faster decisions, better decisions, and real-time actions, applied to a wide variety of use cases, architectures, and industries.

What You'll Learn

Use the Spark shell for interactive data analysis
Features of Spark's Resilient Distributed Datasets
Fundamentals of running Spark on a cluster
Parallel programming with Spark
Write Spark applications
Process streaming data with Spark

Who Needs to Attend

Developers and software engineers

Prerequisites

Some programming experience (Python and Scala suggested)
Basic knowledge of Linux
Knowledge of Hadoop not required

Follow-On Courses

There are no follow-ons for this course.

Course Outline

1. Why Spark?

Problems with Traditional Large-Scale Systems
Introducing Spark

2. Spark Basics

What is Apache Spark?
Using the Spark Shell
Resilient Distributed Datasets (RDDs)
Functional Programming with Spark

3. Working with RDDs

RDD Operations
Key-Value Pair RDDs
MapReduce and Pair RDD Operations

4. The Hadoop Distributed File System

Why HDFS?
HDFS Architecture
Using HDFS

5. Running Spark on a Cluster

A Spark Standalone Cluster
The Spark Standalone Web UI

6. Parallel Programming with Spark

RDD Partitions and HDFS Data Locality
Working with Partitions
Executing Parallel Operations

7. Caching and Persistence

RDD Lineage
Caching Overview
Distributed Persistence

8. Writing Spark Applications

Spark Applications vs. Spark Shell
Creating the SparkContext
Configuring Spark Properties
Building and Running a Spark Application
Logging

9. Spark, Hadoop, and the Enterprise Data Center

Spark and the Hadoop Ecosystem
Spark and MapReduce

10. Spark Streaming

Example: Streaming Word Count
Other Streaming Operations
Sliding Window Operations
Developing Spark Streaming Applications

11. Common Spark Algorithms

Iterative Algorithms
Graph Analysis
Machine Learning

12. Improving Spark Performance

Shared Variables: Broadcast Variables
Shared Variables: Accumulators
Common Performance Issues

Cloudera Developer Training for Apache Spark

Overview

What You'll Learn

Who Needs to Attend

Prerequisites

Follow-On Courses

Course Outline

1. Why Spark?

2. Spark Basics

3. Working with RDDs

4. The Hadoop Distributed File System

5. Running Spark on a Cluster

6. Parallel Programming with Spark

7. Caching and Persistence

8. Writing Spark Applications

9. Spark, Hadoop, and the Enterprise Data Center

10. Spark Streaming

11. Common Spark Algorithms

12. Improving Spark Performance

Request Quote

About Us

Payment Methods

Contact

Main Menu

Solutions by Role

Connect With Us

About PI

Cloudera Developer Training for Apache Spark

Overview

What You'll Learn

Who Needs to Attend

Prerequisites

Follow-On Courses

Course Outline

1. Why Spark?

2. Spark Basics

3. Working with RDDs

4. The Hadoop Distributed File System

5. Running Spark on a Cluster

6. Parallel Programming with Spark

7. Caching and Persistence

8. Writing Spark Applications

9. Spark, Hadoop, and the Enterprise Data Center

10. Spark Streaming

11. Common Spark Algorithms

12. Improving Spark Performance

Request Quote

About Us

Payment Methods

Contact

Main Menu

Solutions by Role

Connect With Us