{"id":1066,"date":"2024-11-11T04:04:51","date_gmt":"2024-11-11T04:04:51","guid":{"rendered":"https:\/\/192.168.1.3\/wordpress\/?p=1066"},"modified":"2024-12-18T00:29:12","modified_gmt":"2024-12-18T00:29:12","slug":"aws-certified-data-engineer-associate-dea-c01-review-material-amazon-emr","status":"publish","type":"post","link":"https:\/\/mylinuxsite.com\/wordpress\/?p=1066","title":{"rendered":"AWS Certified Data Engineer Associate (DEA-C01) Review Material \u2013 Amazon EMR"},"content":{"rendered":"\n<!--more-->\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Overview<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li>EMR (Elastic Map Reduce)<\/li><li>A <strong>managed cluster platform<\/strong> that simplifies running big data frameworks, such as&nbsp;<a rel=\"noreferrer noopener\" href=\"https:\/\/aws.amazon.com\/elasticmapreduce\/details\/hadoop\" target=\"_blank\">Apache Hadoop<\/a>&nbsp;and&nbsp;<a rel=\"noreferrer noopener\" href=\"https:\/\/aws.amazon.com\/elasticmapreduce\/details\/spark\" target=\"_blank\">Apache Spark<\/a>, on AWS to process and analyze vast amounts of data.&nbsp;<\/li><li>It lets you <em>transform<\/em> and <em>move<\/em><strong> large amounts of data <\/strong>into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.<\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Architecture<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li>The central component of Amazon EMR is the&nbsp;<em><span style=\"color:#a31600\" class=\"has-inline-color\">cluster<\/span><\/em>.<\/li><li>A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. <\/li><li>Each instance in the cluster is called a&nbsp;<em><span style=\"color:#a31200\" class=\"has-inline-color\">node<\/span><\/em>. Each node has a <em>role<\/em> within the cluster, referred to as the&nbsp;<span style=\"color:#a30d00\" class=\"has-inline-color\"><em>node type<\/em>.<\/span> <\/li><li>Amazon EMR also <em>installs different software components on each node type<\/em>, giving each node a role in a distributed application like Apache Hadoop.<\/li><li>All EMR clusters, including high-availability clusters, <strong>are launched in a single Availability Zone<\/strong>.&nbsp;<\/li><li><strong>Node Types:<\/strong><ul><li><strong>Primary node<\/strong><ul><li>manages the cluster&nbsp;<\/li><li>runs software components to coordinate the distribution of data and tasks among other nodes for processing<\/li><li>tracks the status of tasks and monitors the health of the cluster<\/li><li>it&#8217;s possible to create a single-node cluster with only the primary node<\/li><\/ul><\/li><li><strong>Core node<\/strong><ul><li>has software components that run tasks and store data in the Hadoop Distributed File System (HDFS)<\/li><li>Multi-node clusters have at least one core node.<\/li><li>Can resize a running core node.<\/li><li>Can add or remove core node but with a risk of losing data.<\/li><li>AWS EMR will automatically provision if a core node fails.<\/li><li>Runs YARN NodeManager daemons, Hadoop MapReduce tasks, and Spark <strong>executors<\/strong>.<\/li><\/ul><\/li><\/ul><ul><li><strong>Task node<\/strong><ul><li>runs software components that only run tasks and do not store data in HDFS. <\/li><li>Task nodes are optional.<\/li><li>Can remove task nodes on the fly.<\/li><\/ul><\/li><\/ul><\/li><li><strong>Cluster Scaling:<\/strong><ul><li><strong>Automatic Scaling<\/strong>:<ul><li>Two options:<ol><li><span style=\"color:#a31600\" class=\"has-inline-color\">Custom Scaling<\/span><ul><li>You need to define and manage the automatic scaling policies and rules,<\/li><li>Instance groups only<\/li><li>Based on CW metrics<\/li><li>Programmatically scale out and scale in core nodes and task nodes based on a CloudWatch metric and other parameters that you specify in a scaling policy<\/li><li>You can define the evaluation periods only in five-minute increments.<\/li><li>You can choose which applications are supported&nbsp;<\/li><\/ul><\/li><li><span style=\"color:#a31600\" class=\"has-inline-color\">Managed Scaling<\/span><ul><li>No policy is required. Amazon EMR manages the automatic scaling activity.<\/li><li>Increase or decrease the number of instances or units in your cluster based on workload.<\/li><li>I<span style=\"color: var(--ast-global-color-3); font-size: 1rem; font-weight: inherit;\">nstance groups or instance fleets<\/span><\/li><li>Only YARN applications are supported, such as Spark, Hadoop, Hive, Flink.<\/li><\/ul><\/li><\/ol><\/li><\/ul><\/li><li><strong>Manual Scaling<\/strong><ul><li>Add and remove instances manually from core and task instance groups and instance fleets in a running cluster.<\/li><\/ul><\/li><\/ul><\/li><li><strong>Service Architecture Layer:<\/strong><ul><li><strong>Storage<\/strong><ul><li>Different file systems are used with your cluster.&nbsp;<ul><li>Hadoop Distributed File System (<strong>HDFS<\/strong>) :<ul><li>A distributed, scalable file system for Hadoop<\/li><li>Multiple copies of data across instances.<\/li><li>Ephemeral, i.e. data will be lost when the cluster is shut down.<\/li><li>You can use <strong>S3DistCP<\/strong> to efficiently copy large amounts of data from Amazon S3 into <strong>HDFS<\/strong> where subsequent steps in your Amazon EMR cluster can process it<\/li><\/ul><\/li><li>EMR File System (<strong>EMRFS<\/strong>):<ul><li>Extends Hadoop to add the ability to directly access data stored in Amazon S3 as if it were a file system like HDFS<\/li><li>Persistent<\/li><\/ul><\/li><li>Local file system &#8211; An Amazon EC2 preconfigured block of pre-attached disk storage called an instance store<\/li><\/ul><\/li><\/ul><\/li><li><strong>Cluster resource management<\/strong><ul><li>Manages cluster resources and scheduling the jobs for processing data.<\/li><li>By default, Amazon EMR uses YARN (Yet Another Resource Negotiator)<\/li><li>Other frameworks and applications that are offered in Amazon EMR that do not use YARN as a resource manager.<\/li><\/ul><\/li><li><strong>Data processing frameworks<\/strong><ul><li>The engine used to process and analyze data.<\/li><li>The main processing frameworks available are:<ul><li><strong>Hadoop MapReduce<\/strong><\/li><li><strong>Apache Spark<\/strong><\/li><\/ul><\/li><\/ul><\/li><li><strong>Applications and programs<\/strong><ul><li>Supports many applications such as Hive, Pig, and the Spark Streaming library&nbsp;<\/li><\/ul><\/li><\/ul><\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Cluster Termination<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>3 Ways to Shutdown Cluster<\/strong><ol><li><strong>Termination after the last step of execution<\/strong><ul><li>a <em>transient<\/em> cluster that shuts down after all steps are complete.<\/li><li>the cluster starts, runs bootstrap actions, and then runs the steps that you specify. As soon as the last step completes, Amazon EMR <em>terminates<\/em> the cluster&#8217;s Amazon EC2 instances.&nbsp;<\/li><\/ul><\/li><li><strong>Auto-termination (after idle)<\/strong>&nbsp;<ul><li>auto-termination policy that shuts down after a specified idle time.&nbsp;<\/li><li>You specify the amount of idle time after which the cluster should automatically shut down.<\/li><\/ul><\/li><li><strong>Manual termination<\/strong><ul><li>is a long-running cluster that continues to run until you terminate it deliberately.<\/li><\/ul><\/li><\/ol><\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Serverless EMR<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li>A deployment option for Amazon EMR that provides a serverless runtime environment.<\/li><li>You don\u2019t have to configure, optimize, secure, or operate clusters to run applications&nbsp;like Spark, Hive or Presto.<\/li><li>Avoid over- or under-provisioning resources for your data processing jobs.<\/li><li>It automatically determines the resources that the application needs, obtains these resources to process your jobs, and releases the resources when the jobs finish.&nbsp;<\/li><li>You can provide a&nbsp;<em><span style=\"color:#a31600\" class=\"has-inline-color\">pre-initialized capacity <\/span><\/em>that keeps workers initialized and ready to respond in seconds.<ul><li>Make sure to add 10% in your initial capacity because Spark adds 10% overhead.<\/li><\/ul><\/li><li><strong>Running Jobs:<\/strong><ol><li>Create an EMR Serverless application (CLI or EMR Studio)<\/li><li>Submit a job run (CLI or use notebooks that are hosted in EMR Studio to run interactive workloads)<\/li><\/ol><\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>EMR on EKS<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/192.168.1.3\/wordpress\/wp-content\/uploads\/2024\/10\/emr-on-eks-architecture-1024x576.png\" alt=\"\" class=\"wp-image-1084\" width=\"723\" height=\"406\" srcset=\"https:\/\/mylinuxsite.com\/wordpress\/wp-content\/uploads\/2024\/10\/emr-on-eks-architecture-1024x576.png 1024w, https:\/\/mylinuxsite.com\/wordpress\/wp-content\/uploads\/2024\/10\/emr-on-eks-architecture-300x169.png 300w, https:\/\/mylinuxsite.com\/wordpress\/wp-content\/uploads\/2024\/10\/emr-on-eks-architecture-768x432.png 768w, https:\/\/mylinuxsite.com\/wordpress\/wp-content\/uploads\/2024\/10\/emr-on-eks-architecture.png 1536w\" sizes=\"auto, (max-width: 723px) 100vw, 723px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\"><li>A deployment option for Amazon EMR that allows you to run open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS).<\/li><li>You can run Amazon EMR based applications with other types of applications on the same Amazon EKS cluster.&nbsp;<\/li><li>Fully managed by AWS.<\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Spark<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Components:<\/strong><ol><li>Spark Core<\/li><li>Spark Streaming<\/li><li>Spark MLib<\/li><li>Spark GraphX<\/li><li>Spark SQL<\/li><\/ol><\/li><li><strong>Spark Core:<\/strong><ul><li>Performs:<ul><li>Scheduling, monitoring, and distributing jobs<\/li><li>Fault management<\/li><li>Memory management<\/li><li>Storage interface<\/li><\/ul><\/li><li>Uses <span style=\"color:#a31200\" class=\"has-inline-color\">RDD (Resilient Distributed Dataset)<\/span><ul><li>Transformation operations create RDD<\/li><li>Action operations process data<\/li><\/ul><\/li><\/ul><\/li><li><strong>Spark Streaming:<\/strong><ul><li>Process streamed data in batch or real-time<\/li><li>Input data are broken in batches for processing.<\/li><\/ul><\/li><li><strong>Spark MLib:<\/strong><ul><li>Low-level machine learning library<\/li><\/ul><\/li><li><strong>Spark GraphX:<\/strong><ul><li>The engine that can handle and process Graph data.<\/li><\/ul><\/li><li><strong>Spark SQL:<\/strong><ul><li>The framework that is used to process structured or semi-structured data.<\/li><li>It allows you to work on different data formats.<\/li><li><\/li><\/ul><\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>MapReduce<\/strong><\/h3>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Security<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li>EC2 key pair for SSH<\/li><li>Encryption in transit<\/li><li>Encryption at rest:<ul><li>EBS volumes<ul><li>EBS encryption<\/li><li>LUKS (Linux Unified Key Setup) encryption  &#8211; does not work with root volume)<\/li><li>Open-source HDFS encryption<\/li><\/ul><\/li><\/ul><ul><li>S3 bucket (EMR by default uses the EMR file system (EMRFS)<\/li><\/ul><\/li><li>IAM Roles\/Policy:<ul><li>Service Role<\/li><li>EC2 instance profile<\/li><\/ul><\/li><li>Authentication:<ul><li>IAM (temporary security credential)<\/li><li>IAM Identity Center (single sign-on)<\/li><li>Kerberos<\/li><li>LDAP<\/li><\/ul><\/li><li>Authorization:<ul><li>Apache Ranger (RBAC)<\/li><li>IAM Role<\/li><\/ul><\/li><li><a href=\"https:\/\/docs.aws.amazon.com\/images\/emr\/latest\/ManagementGuide\/images\/emr-encryption-options.png\">https:\/\/docs.aws.amazon.com\/images\/emr\/latest\/ManagementGuide\/images\/emr-encryption-options.png<\/a><\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[11],"tags":[],"class_list":["post-1066","post","type-post","status-publish","format-standard","hentry","category-aws-review-notes"],"_links":{"self":[{"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/1066","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1066"}],"version-history":[{"count":32,"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/1066\/revisions"}],"predecessor-version":[{"id":1394,"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/1066\/revisions\/1394"}],"wp:attachment":[{"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1066"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1066"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1066"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}