Preparing with Cloudera Data Engineering (PCDE) – Outline

Detailed Course Outline

HDFS Introduction
  • HDFS Overview
  • HDFS Components and Interactions
  • Additional HDFS Interactions
  • Ozone Overview
  • Exercise: Working with HDFS
YARN Introduction
  • YARN Overview
  • YARN Components and Interaction
  • Working with YARN
  • Exercise: Working with YARN
Working with RDDs
  • Resilient Distributed Datasets (RDDs)
  • Exercise: Working with RDDs
Working with DataFrames
  • Introduction to DataFrames
  • Exercise: Introducing DataFrames
  • Exercise: Reading and Writing DataFrames
  • Exercise: Working with Columns
  • Exercise: Working with Complex Types
  • Exercise: Combining and Splitting DataFrames
  • Exercise: Summarizing and Grouping DataFrames
  • Exercise: Working with UDFs
  • Exercise: Working with Windows
Introduction to Apache Hive
  • About Hive
  • Transforming data with Hive QL
Working with Apache Hive
  • Exercise: Working with Partitions
  • Exercise: Working with Buckets
  • Exercise: Working with Skew
  • Exercise: Using Serdes to Ingest Text Data
  • Exercise: Using Complex Types to Denormalize Data
Hive and Spark Integration
  • Hive and Spark Integration
  • Exercise: Spark Integration with Hive
Distributed Processing Challenges
  • Shuffle
  • Skew
  • Order
Spark Distributed Processing
  • Spark Distributed Processing
  • Exercise: Explore Query Execution Order
Spark Distributed Persistence
  • DataFrame and Dataset Persistence
  • Persistence Storage Levels
  • Viewing Persisted RDDs
  • Exercise: Persisting DataFrames
Data Engineering Service
  • Create and Trigger Ad-Hoc Spark Jobs
  • Orchestrate a Set of Jobs Using Airflow
  • Data Lineage using Atlas
  • Auto-scaling in Data Engineering Service
Workload XM
  • Optimize Workloads, Performance, Capacity
  • Identify Suboptimal Spark Jobs
Appendix: Working with Datasets in Scala
  • Working with Datasets in Scala
  • Exercise: Using Datasets in Scala