> > CDTMR Detaillierte Beschreibung

Cloudera Developer Training for MapReduce (CDTMR)

Kursbeschreibung Kurstermine Detaillierter Kursinhalt

Detaillierter Kursinhalt

Module 1: The Motivation for Hadoop

  • Problems with Traditional Large-Scale Systems
  • Introducing Hadoop
  • Hadoopable Problems

Module 2: Hadoop: Basic Concepts and HDFS

  • The Hadoop Project and Hadoop Components
  • The Hadoop Distributed File System

Module 3: Introduction to MapReduce

  • MapReduce Overview
  • Example: WordCount
  • Mappers
  • Reducers

Module 4: Hadoop Clusters and the Hadoop Ecosystem

  • Hadoop Cluster Overview
  • Hadoop Jobs and Tasks
  • Other Hadoop Ecosystem Components

Module 5: Writing a MapReduce Program in Java

  • Basic MapReduce API Concepts
  • Writing MapReduce Drivers, Mappers and Reducers in Java
  • Speeding Up Hadoop Development by Using Eclipse
  • Differences Between the Old and New MapReduce APIs

Module 6: Writing a MapReduce Program Using Streaming

  • Writing Mappers and Reducers with the Streaming API

Module 7: Unit Testing MapReduce Programs

  • Unit Testing
  • The JUnit and MRUnit Testing Frameworks
  • Writing Unit Tests with MRUnit
  • Running Unit Tests

Module 8: Delving Deeper into the Hadoop API

  • Using the ToolRunner Class
  • Setting Up and Tearing Down Mappers and Reducers
  • Decreasing the Amount of Intermediate Data with Combiners
  • Accessing HDFS Programmatically
  • Using The Distributed Cache
  • Using the Hadoop API’s Library of Mappers, Reducers, and Partitioners

Module 9: Practical Development Tips and Techniques

  • Strategies for Debugging MapReduce Code
  • Testing MapReduce Code Locally by Using LocalJobRunner
  • Writing and Viewing Log Files
  • Retrieving Job Information with Counters
  • Reusing Objects
  • Creating Map-Only MapReduce Jobs

Module 10: Partitioners and Reducers

  • How Partitioners and Reducers Work Together
  • Determining the Optimal Number of Reducers for a Job
  • Writing Customer Partitioners

Module 11: Data Input and Output

  • Creating Custom Writable and WritableComparable Implementations
  • Saving Binary Data Using SequenceFile and Avro Data Files
  • Issues to Consider When Using File Compression
  • Implementing Custom InputFormats and OutputFormats

Module 12: Common MapReduce Algorithms

  • Sorting and Searching Large Data Sets
  • Indexing Data
  • Computing Term Frequency — Inverse Document Frequency
  • Calculating Word Co-Occurrence
  • Performing Secondary Sort

Module 13: Joining Data Sets in MapReduce Jobs

  • Writing a Map-Side Join
  • Writing a Reduce-Side Join

Module 14: Integrating Hadoop into the Enterprise Workflow

  • Integrating Hadoop into an Existing Enterprise
  • Loading Data from an RDBMS into HDFS by Using Sqoop
  • Managing Real-Time Data Using Flume
  • Accessing HDFS from Legacy Systems with FuseDFS and HttpFS

Module 15: An Introduction to Hive, Imapala, and Pig

  • The Motivation for Hive, Impala, and Pig
  • Hive Overview
  • Impala Overview
  • Pig Overview
  • Choosing Between Hive, Impala, and Pig

Module 16: An Introduction to Oozie

  • Introduction to Oozie
  • Creating Oozie Workflows