> > CDAPHIH Detaillierte Beschreibung

Cloudera Data Analyst Training: Using Pig, Hive and Impala with Hadoop (CDAPHIH)

Kursbeschreibung Kurstermine Detaillierter Kursinhalt

Detaillierter Kursinhalt

Module 1: Hadoop Fundamentals
  • The Motivation for Hadoop
  • Hadoop Overview
  • Data Storage: HDFS
  • Distributed Data Processing: YARN, MapReduce, and Spark
  • Data Processing and Analysis: Pig, Hive, and Impala
  • Data Integration: Sqoop
  • Other Hadoop Data Tools
  • Exercise Scenarios Explanation
Module 2: Introduction to Pig
  • What Is Pig?
  • Pig’s Features
  • Pig Use Cases
  • Interacting with Pig
Module 3: Basic Data Analysis with Pig
  • Pig Latin Syntax
  • Loading Data
  • Simple Data Types
  • Field Definitions
  • Data Output
  • Viewing the Schema
  • Filtering and Sorting Data
  • Commonly-Used Functions
Module 4: Processing Complex Data with Pig
  • Storage Formats
  • Complex/Nested Data Types
  • Grouping
  • Built-In Functions for Complex Data
  • Iterating Grouped Data
Module 5: Multi-Dataset Operations with Pig
  • Techniques for Combining Data Sets
  • Joining Data Sets in Pig
  • Set Operations
  • Splitting Data Sets
Module 6: Pig Troubleshooting and Optimization
  • Troubleshooting Pig
  • Logging
  • Using Hadoop’s Web UI
  • Data Sampling and Debugging
  • Performance Overview
  • Understanding the Execution Plan
  • Tips for Improving the Performance of Your Pig Jobs
Module 7: Introduction to Hive and Impala
  • What Is Hive?
  • What Is Impala?
  • Schema and Data Storage
  • Comparing Hive to Traditional Databases
  • Hive Use Cases
Module 8: Querying with Hive and Impala
  • Databases and Tables
  • Basic Hive and Impala Query Language Syntax
  • Data Types
  • Differences Between Hive and Impala Query Syntax
  • Using Hue to Execute Queries
  • Using the Impala Shell
Module 9: Data Management
  • Data Storage
  • Creating Databases and Tables
  • Loading Data
  • Altering Databases and Tables
  • Simplifying Queries with Views
  • Storing Query Results
Module 10: Data Storage and Performance
  • Partitioning Tables
  • Choosing a File Format
  • Managing Metadata
  • Controlling Access to Data
Module 11: Relational Data Analysis with Hive and Impala
  • Joining Datasets
  • Common Built-In Functions
  • Aggregation and Windowing
Module 12: Working with Impala
  • How Impala Executes Queries
  • Extending Impala with User-Defined Functions
  • Improving Impala Performance
Module 13: Analyzing Text and Complex Data with Hive
  • Complex Values in Hive
  • Using Regular Expressions in Hive
  • Sentiment Analysis and N-Grams
  • Conclusion
Module 14: Hive Optimization
  • Understanding Query Performance
  • Controlling Job Execution Plan
  • Bucketing
  • Indexing Data
Module 15: Extending Hive
  • SerDes
  • Data Transformation with Custom Scripts
  • User-Defined Functions
  • Parameterized Queries
Module 16: Choosing the Best Tool for the Job
  • Comparing MapReduce, Pig, Hive, Impala and Relational Databases
  • Which to Choose?