AWS Certified Data Engineer Associate (DEA-C01) Review Material – Glue

Overview

  • A serverless data integration service primarily used for ETL.
  • Key Components are:
    • Data Catalog – A central metadata repository of your data lake
    • ETL Jobs – Perform processing on the data. Run on-demand, on a schedule or triggered by an event
    • Workflows – An orchestration tool within Glue.
  • Data Catalog:
    • Databases – used to organize metadata tables.
    • Tables – metadata definition that represents the data in a data store.
      • The definitions include:
        • Schema or data structure
        • Data Format, e.g. CSV, Parquet, ORC, JSON, AVRO, XML
        • Datastore, e.g. S3, Kinesis, Kafka
        • Partitions
          • All the following conditions must be true for AWS Glue to create a partitioned table for an Amazon S3 folder:
            • The schemas of the files are similar, as determined by AWS Glue.
            • The data format of the files is the same.
            • The compression format of the files is the same.
        • Partition Index
          • A list of partition keys that already exist on a given table.
          • Partition index is sub list of partition keys defined in the table. Speed up GetPartitions API call (the API to identify what partition to read)
      • Tables belong to a Database
      • Schema can be manually created or populated by Crawlers.
    • Crawlers – populate the AWS Glue Data Catalog with databases and tables
      • The crawler connects to the data store.
      • Can crawl multiple data stores in a single run. 
      • It has built-in and custom classifiers.
      • Some data stores require connection properties for crawler access.
    • Connections – an object that stores login credentials, URI strings, virtual private cloud (VPC) information, and more for a particular data store.
      • can be both used for sources and targets
      • built-in connectors to AWS resources:
        • Aurora
        • Redshift
        • Kafka
        • AWS DocumentDB
        • AWS OpenSearch
    • It can be used as a metadata store for Hive ( Hive allows you to run SQL-like queries on an EMR)
  • ETL Jobs:
    • Perform a set of ETL operations from various data sources.
    • Encapsulates a script that connects to your source data, processes it, and then writes it out to your data target. Typically, a job runs extract, transform, and load (ETL) scripts. 
    • Visual ETL will automatically generate the script in Python or Scala and can be modified. You can also provide your own Spark or PySpark scripts.
    • You can configure your AWS Glue ETL jobs to run within a VPC when using connectors.
    • Uses a Spark platform under the hood.
    • You are charged hourly based on the number of data processing units (DPUs) used to run your ETL job. A single standard DPU provides 4 vCPU and 16 GB of memory, whereas a high-memory DPU (M-DPU) provides 4 vCPU and 32 GB of memory.
    • You can set the no. of workers or the worker type (i.e. more CPU/Mem)
    • Run on schedule or triggered by an event.
    • You can set properties of your tables to enable an AWS Glue ETL job to group files when they are read from an Amazon S3 data store ( dynamic frame file-grouping).
    • Dynamic Frame:
    • Transformations:
      • Bundled
      • Machine Learning
      • Format Conversion
      • Apache Spark Transformation
    • Execution Class:
      • Standard – ideal for time-sensitive workloads that require fast job startup and dedicated resources.
      • Flex – appropriate for time-insensitive jobs whose start and completion times may vary.
    • ETL scripts can modify the Data Catalog and partition keys:
      • New Partitions: Pass enableUpdateCatalog and partitionKeys
      • Update Table Schema: enableUpdateCatalog and  updateBehavior
      • Create New Tables: setCatalogInfo, updateBehavior and enableUpdateCatalog
    • Job bookmarks
      • It helps AWS Glue maintain state information and prevent the reprocessing of old data. By persisting state information from the previous run of an ETL job, AWS Glue tracks data that has already been processed. 
      • The maximum concurrency must be set to 1
      • The script must end with the job. Commit().
    • Glue Studio – a graphical interface that makes it easy to create, run, and monitor data integration jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on the Apache Spark–based serverless ETL engine in AWS Glue.
      • Data Quality:
        • allows you to measure and monitor the quality of your data so that you can make good business decisions
        • works with Data Quality Definition Language (DQDL), which is a domain specific language that you use to define data quality rules.
    • Can be coded using:
      1. PySpark – more features and has reduced wait times,
      2. Python Shell – to run small to medium-sized generic tasks that are often part of an ETL workflow. Cannot be used with Job bookmarks.
      3. Scala
  • Workflows:
    • For orchestrating complex ETL operations.
    • Used to create and visualize complex extract, transform, and load (ETL) activities involving multiple crawlers, jobs, and triggers.
    • It can be triggered by:
      1. Schedule
      2. On-Demand
      3. Event Bridge
  • Data Brew
    • A visual data preparation tool that makes it easier for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning (ML).

Security

  • IAM Policies to access Glue.
  • KMS encryption on:
    • Meta data stored in the Data Catalog
    • Connection setting
    • S3
    • CW log
    • Job bookmarks
    • Data Quality
  • Resource policy
  • SSL on a connection from the client to Glue or from Glue to target/source.

Hands-On

Test Data

We will use two(2) sets of test data. The first set will be in XML format and copied to an S3 bucket, and the second set will be a MySQL RDS Database.

  1. XML File
    • The data is located at this link: https://data.gov.hk/en-data/dataset/hk-rvd-tsinfo_rvd-names-of-buildings. The data is a list of building names in Hong Kong.
    • Download the files named (1) Names of Buildings (Volume 1) Hong Kong Island and Kowloon, and (2) Names of Buildings (Volume 2) The New Territories
    • The content of the file has no line breaks. In short, all the records will appear as one long line. Because of this, the crawler will not be able to identify the structure of the file. So you need to ‘pretty’ the document. I used xmllint for that:
      • $ xmllint -format <the_ugly_file.xml> > <the_pretty_file.xml>
    • Copy the files to an S3 bucket.
  2. MySQL RDS
    • TODO
Crawlers
  • Create a database
    • Note: The table depicted in this image will be created by the crawler. Do not create any table.
XML Crawlers
  • Create the XML crawler:
  • Create a custom classifier for the XML crawler. Use ‘Record‘ as the ‘Row Tag’.
  • Run the crawler to generate the table and its schema:
ETL Jobs
S3 to RedShift

We will use the ETL Job to load the data from S3 to RedShift, removing some of the columns along the way. For this hands-on, we will use Serverless Redshift because it has a free tier.

  • Create the Serverless Redshift
    • If this is the first time your account will use the Serverless Redshift, the dashboard will prompt you the necessary information to set up a Namespace and Workgroup in the Serverless Redshift.
    • If your Serverless Redshift was already activated, then create a Namespace and Workgroup for this hands-on:
      • Namespace:
        • Note the ‘Admin user name’ and the ‘Database name’.
      • Workgroup:
        • Note: Your Endpoint, JDBC Url and ODBC Url should point to the correct database.
        • Note: Place your Redshift in the right VPC with correct Security Group.
        • Note: Choose the lowest RPU that you can choose to reduce cost.
    • Once you set up your Namespace and Workgroup, connect to your database and create a table using the ‘admin’ user. If you forgot its password, you can go to your Namespace. Under ‘Actions’ choose ‘Edit admin credentials’.
      • Create a table named ‘hk_buildings’ with the following columns:
        • CREATE TABLE IF NOT EXISTS public.hk_buildings (EnglishAddress1 VARCHAR(512), EnglishAddress2 VARCHAR(512), EnglishAddress3 VARCHAR(512), EnglishBuildingName1 VARCHAR(512), EnglishBuildingName2 VARCHAR(512), EnglishBuildingName3 VARCHAR(512), EnglishPublicHousingType VARCHAR(512), OwnersCorporation VARCHAR(512), YearBuild VARCHAR(512));
  • Create a new AWS Glue Connection:
    • Note: Use the admin username and password to connect. Also, make sure that you are in the right Subnet/VPC and that your Security Group allows outgoing connections to RedShift.
  • Create the ETL job using the Visual ETL tool (Note: DO NOT enable the Data Preview as this will start an Interactive Session, which you will be charged)
    • Your job will have four nodes:
      1. AWS Glue Data Catalog – This will be your Data Source.
      2. Change SchemaYou need this transformation node to change the Field Names to Camel Case. The next transformation node (Drop Fields)is case-sensitive, so it won’t recognize the fields if you don’t transform them.
      3. Drop Fields – This transformation node will drop fields that contain Chinese characters.
      4. Amazon Redshift – This will write the transformed record to the Amazon RedShift.
  • Run the Job:
    • Lunch your Amazon Redshift Query Editor to check if the records are load to your table:
Workflow

We will create a workflow triggered by a new file uploaded to the S3 bucket. The workflow will first execute the crawler, followed by the ETL job we created earlier in this hands-on.

  • Create a workflow:
    • Start with an EVENT triggered by the Event Bridge Rule, followed by a crawler, and then an ETL job.
  • Create an Event Bridge Rule:
    • The event must be triggered by the creation of an Object in an S3 folder:
    • Select the Glue ETL job created in the previous hands-on as the target:
  • Trigger the Event Bridge Rule:
    • Upload one of the files in the S3 bucket (Note that you must clear the folder so it will only process the files that you uploaded)
    • After uploading the file, the workflow will be triggered:
    • Check the result if the data is loaded in the Redshift cluster:

Leave a Comment

Your email address will not be published. Required fields are marked *