AWS Certified Data Engineer Associate (DEA-C01) Review Material – Amazon EMR

Overview

EMR (Elastic Map Reduce)
A managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.
It lets you transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

Architecture

The central component of Amazon EMR is the cluster.
A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances.
Each instance in the cluster is called a node. Each node has a role within the cluster, referred to as the node type.
Amazon EMR also installs different software components on each node type, giving each node a role in a distributed application like Apache Hadoop.
All EMR clusters, including high-availability clusters, are launched in a single Availability Zone.
Node Types:
- Primary node
  - manages the cluster
  - runs software components to coordinate the distribution of data and tasks among other nodes for processing
  - tracks the status of tasks and monitors the health of the cluster
  - it’s possible to create a single-node cluster with only the primary node
- Core node
  - has software components that run tasks and store data in the Hadoop Distributed File System (HDFS)
  - Multi-node clusters have at least one core node.
  - Can resize a running core node.
  - Can add or remove core node but with a risk of losing data.
  - AWS EMR will automatically provision if a core node fails.
  - Runs YARN NodeManager daemons, Hadoop MapReduce tasks, and Spark executors.
- Task node
  - runs software components that only run tasks and do not store data in HDFS.
  - Task nodes are optional.
  - Can remove task nodes on the fly.
Cluster Scaling:
- Automatic Scaling:
  - Two options:
    1. Custom Scaling
      - You need to define and manage the automatic scaling policies and rules,
      - Instance groups only
      - Based on CW metrics
      - Programmatically scale out and scale in core nodes and task nodes based on a CloudWatch metric and other parameters that you specify in a scaling policy
      - You can define the evaluation periods only in five-minute increments.
      - You can choose which applications are supported
    2. Managed Scaling
      - No policy is required. Amazon EMR manages the automatic scaling activity.
      - Increase or decrease the number of instances or units in your cluster based on workload.
      - Instance groups or instance fleets
      - Only YARN applications are supported, such as Spark, Hadoop, Hive, Flink.
- Manual Scaling
  - Add and remove instances manually from core and task instance groups and instance fleets in a running cluster.
Service Architecture Layer:
- Storage
  - Different file systems are used with your cluster.
    - Hadoop Distributed File System (HDFS) :
      - A distributed, scalable file system for Hadoop
      - Multiple copies of data across instances.
      - Ephemeral, i.e. data will be lost when the cluster is shut down.
      - You can use S3DistCP to efficiently copy large amounts of data from Amazon S3 into HDFS where subsequent steps in your Amazon EMR cluster can process it
    - EMR File System (EMRFS):
      - Extends Hadoop to add the ability to directly access data stored in Amazon S3 as if it were a file system like HDFS
      - Persistent
    - Local file system – An Amazon EC2 preconfigured block of pre-attached disk storage called an instance store
- Cluster resource management
  - Manages cluster resources and scheduling the jobs for processing data.
  - By default, Amazon EMR uses YARN (Yet Another Resource Negotiator)
  - Other frameworks and applications that are offered in Amazon EMR that do not use YARN as a resource manager.
- Data processing frameworks
  - The engine used to process and analyze data.
  - The main processing frameworks available are:
    - Hadoop MapReduce
    - Apache Spark
- Applications and programs
  - Supports many applications such as Hive, Pig, and the Spark Streaming library

Cluster Termination

3 Ways to Shutdown Cluster
1. Termination after the last step of execution
  - a transient cluster that shuts down after all steps are complete.
  - the cluster starts, runs bootstrap actions, and then runs the steps that you specify. As soon as the last step completes, Amazon EMR terminates the cluster’s Amazon EC2 instances.
2. Auto-termination (after idle)
  - auto-termination policy that shuts down after a specified idle time.
  - You specify the amount of idle time after which the cluster should automatically shut down.
3. Manual termination
  - is a long-running cluster that continues to run until you terminate it deliberately.

Serverless EMR

A deployment option for Amazon EMR that provides a serverless runtime environment.
You don’t have to configure, optimize, secure, or operate clusters to run applications like Spark, Hive or Presto.
Avoid over- or under-provisioning resources for your data processing jobs.
It automatically determines the resources that the application needs, obtains these resources to process your jobs, and releases the resources when the jobs finish.
You can provide a pre-initialized capacity that keeps workers initialized and ready to respond in seconds.
- Make sure to add 10% in your initial capacity because Spark adds 10% overhead.
Running Jobs:
1. Create an EMR Serverless application (CLI or EMR Studio)
2. Submit a job run (CLI or use notebooks that are hosted in EMR Studio to run interactive workloads)

EMR on EKS

A deployment option for Amazon EMR that allows you to run open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS).
You can run Amazon EMR based applications with other types of applications on the same Amazon EKS cluster.
Fully managed by AWS.

Spark

Components:
1. Spark Core
2. Spark Streaming
3. Spark MLib
4. Spark GraphX
5. Spark SQL
Spark Core:
- Performs:
  - Scheduling, monitoring, and distributing jobs
  - Fault management
  - Memory management
  - Storage interface
- Uses RDD (Resilient Distributed Dataset)
  - Transformation operations create RDD
  - Action operations process data
Spark Streaming:
- Process streamed data in batch or real-time
- Input data are broken in batches for processing.
Spark MLib:
- Low-level machine learning library
Spark GraphX:
- The engine that can handle and process Graph data.
Spark SQL:
- The framework that is used to process structured or semi-structured data.
- It allows you to work on different data formats.

MapReduce

Security

EC2 key pair for SSH
Encryption in transit
Encryption at rest:
- EBS volumes
  - EBS encryption
  - LUKS (Linux Unified Key Setup) encryption – does not work with root volume)
  - Open-source HDFS encryption
- S3 bucket (EMR by default uses the EMR file system (EMRFS)
IAM Roles/Policy:
- Service Role
- EC2 instance profile
Authentication:
- IAM (temporary security credential)
- IAM Identity Center (single sign-on)
- Kerberos
- LDAP
Authorization:
- Apache Ranger (RBAC)
- IAM Role
https://docs.aws.amazon.com/images/emr/latest/ManagementGuide/images/emr-encryption-options.png

Overview

Architecture

Cluster Termination

Serverless EMR

EMR on EKS

Spark

MapReduce

Security

Leave a Comment Cancel Reply