Duration: 3 Days
In this hands-on course, you will learn how Apache Pig, Apache Hive, and Cloudera Impala enable data transformations and analyses via filters, joins, and user-defined functions familiar from other technologies. You will learn how to apply traditional data analytics and business intelligence skills to big data, and you'll learn how to access, manipulate, and analyze complex data sets using SQL and familiar scripting languages.
Apache Hive makes multi-structured data accessible to analysts, database administrators, and others without Java programming expertise. Apache Pig applies the fundamentals of familiar scripting languages to the Hadoop cluster. Cloudera Impala enables real-time interactive analysis of the data stored in Hadoop via a native SQL environment.
What You Will Learn
 
- Fundamentals of Apache Hadoop and data extract, transform, load (ETL), ingestion, and processing with Hadoop tools
- Joining multiple data sets and analyzing disparate data with Pig
- Organizing data into tables, performing transformations, and simplifying complex queries with Hive
- Performing real-time interactive analyses on massive data sets stored in HDFS or HBase using SQL with Impala
- How to pick the best analysis tool for a given task in Hadoop
Audience
 
Data analysts, business analysts, developers, and administrators
Prerequistes
 
- Familiarity with SQL and basic UNIX or Linux commands
- Prior knowledge of Java and Apache Hadoop is not required
Course Outline
 
1. Hadoop Fundamentals
- The Motivation for Hadoop
- Hadoop Overview
- HDFS
- MapReduce
- The Hadoop Ecosystem
- Hands-On Exercise: Data Ingest with Hadoop Tools
2. Introduction to Pig
- What Is Pig?
- Pig's Features
- Pig Use Cases
- Interacting with Pig
3. Basic Data Analysis with Pig
- Pig Latin Syntax
- Loading Data
- Simple Data Types
- Field Definitions
- Data Output
- Viewing the Schema
- Filtering and Sorting Data
- Commonly Used Functions
- Hands-On Exercise: Using Pig for ETL Processing
4. Processing Complex Data with Pig
- Storage Formats
- Complex/Nested Data Types
- Grouping
- Built-In Functions for Complex Data
- Iterating Grouped Data
- Hands-On Exercise: Analyzing Ad Campaign Data with Pig
5. Multi-Dataset Operations with Pig
- Techniques for Combining Data Sets
- Joining Data Sets in Pig
- Set Operations
- Splitting Data Sets
- Hands-On Exercise: Analyzing Disparate Data Sets with Pig
6. Extending Pig
- Adding Flexibility with Parameters
- Macros and Imports
- UDFs
- Contributed Functions
- Using Other Languages to Process Data with Pig
- Hands-On Exercise: Extending Pig with Streaming and UDFs
7. Pig Troubleshooting and Optimization
- Troubleshooting Pig
- Logging
- Using Hadoop's Web UI
- Optional Demo: Troubleshooting a Failed Job with the Web UI
- Data Sampling and Debugging
- Performance Overview
- The Execution Plan
- Tips for Improving the Performance of Your Pig Jobs
8. Introduction to Hive
- What Is Hive?
- Hive Schema and Data Storage
- Comparing Hive to Traditional Databases
- Hive vs. Pig
- Hive Use Cases
- Interacting with Hive
9. Relational Data Analysis with Hive
- Hive Databases and Tables
- Basic HiveQL Syntax
- Data Types
- Joining Data Sets
- Common Built-In Functions
- Hands-On Exercise: Running Hive Queries on the Shell, Scripts, and Hue
10. Hive Data Management
- Hive Data Formats
- Creating Databases and Hive-Managed Tables
- Loading Data into Hive
- Altering Databases and Tables
- Self-Managed Tables
- Simplifying Queries with Views
- Storing Query Results
- Controlling Access to Data
- Hands-On Exercise: Data Management with Hive
11. Text Processing with Hive
- Overview of Text Processing
- Important String Functions
- Using Regular Expressions in Hive
- Sentiment Analysis and N-Grams
- Hands-On Exercise (Optional): Gaining Insight with Sentiment Analysis
12. Hive Optimization
- Query Performance
- Controlling Job Execution Plan
- Partitioning
- Bucketing
- Indexing Data
13. Extending Hive
- SerDes
- Data Transformation with
- Custom Scripts
- User-Defined Functions
- Parameterized Queries
- Hands-On Exercise: Data Transformation with Hive
14. Introduction to Impala
- What is Impala?
- How Impala Differs from Hive and Pig
- How Impala Differs from Relational Databases
- Limitations and Future Directions
- Using the Impala Shell
15. Analyzing Data with Impala
- Basic Syntax
- Data Types
- Filtering, Sorting, and Limiting Results
- Joining and Grouping Data
- Improving Impala Performance
- Hands-On Exercise: Interactive Analysis with Impala
16. Choosing the Best Tool for the Job
- Comparing MapReduce, Pig, Hive, Impala, and Relational Databases
- Which to Choose?
Course Labs
 
You will participate in hands-on exercises throughout the course.