Analyzing with Cloudera Data Warehouse (ACDW) – Outline

Detailed Course Outline

Foundations for Big Data Analytics
  • Big Data Analytics Overview
  • Data Storage: HDFS
  • Distributed Data Processing: YARN, MapReduce, and Spark
  • Data Processing and Analysis: Hive and Impala
  • Database Integration: Sqoop
  • Other Data Tools
  • Exercise Scenario Explanation
Introduction to Apache Hive and Impala
  • What Is Hive?
  • What Is Impala?
  • Why Use Hive and Impala?
  • Schema and Data Storage
  • Comparing Hive and Impala to Traditional Databases
  • Use Cases
Querying with Apache Hive and Impala
  • Databases and Tables
  • Basic Hive and Impala Query Language Syntax
  • Data Types
  • Using Hue to Execute Queries
  • Using Beeline (Hive's Shell)
  • Using the Impala Shell
Common Operators and Built-In Functions
  • Operators
  • Scalar Functions
  • Aggregate Functions
Data Management
  • Data Storage
  • Creating Databases and Tables
  • Loading Data
  • Altering Databases and Tables
  • Simplifying Queries with Views
  • Storing Query Results
Data Storage and Performance
  • Partitioning Tables
  • Loading Data into Partitioned Tables
  • When to Use Partitioning
  • Choosing a File Format
  • Using Avro and Parquet File Formats
Working with Multiple Datasets
  • UNION and Joins
  • Handling NULL Values in Joins
  • Advanced Joins
Analytic Functions and Windowing
  • Using Analytic Functions
  • Other Analytic Functions
  • Sliding Windows
Complex Data
  • Complex Data with Hive
  • Complex Data with Impala
Analyzing Text
  • Using Regular Expressions with Hive and Impala
  • Processing Text Data with SerDes in Hive
  • Sentiment Analysis and n-grams in Hive
Apache Hive Optimization
  • Understanding Query Performance
  • Cost-Based Optimization and Statistics
  • Bucketing
  • ORC File Optimizations
Apache Impala Optimization
  • How Impala Executes Queries
  • Improving Impala Performance
Extending Hive and Impala
  • User-Defined Functions
  • Parameterized Queries
Choosing the Best Tool for the Job
  • Comparing Hive, Impala, and
  • Relational Databases
  • Which to Choose?
CDP Public Cloud Data Warehouse
  • Data Warehouse Overview
  • Auto-Scaling
  • Managing Virtual Warehouses
  • Querying Data Using CLI and Third-Party Integration
Appendix: Apache Kudu
  • What Is Kudu?
  • Kudu Tables
  • Using Impala with Kudu