AWS Certified Data Engineer Associate (DEA-C01) Review Material – Athena

Overview

  • An interactive query service that makes it easy to analyze data directly in S3 using standard SQL.
  • It can also be used to interactively run data analytics using Apache Spark without having to plan for, configure, or manage resources. 
  • Serverless and scales automatically
  • Support semi-structured and structured data, e.g. CSV, JSON, Parquet, ORC, TSV, Avro
  • To convert your existing raw data from other storage formats to Parquet or ORC, you can run CREATE TABLE AS SELECT (CTAS) queries in Athena
  • Can read compressed files (e.g. Snappy, GZIP, Zlib)
  • Integrates with AWS Glue Data Catalog
  • Uses Presto under the hood
  • Can use ODBC/JDBC to connect to Athena
  • Pay-as-you-go
  • Use an S3 bucket (query result location ) to store query results and query execution result metadata for each query
  • Performs better on:
    • columnar data format e.g. ORC or Parquet
    • on large files
    • partitioned data
  • Supports ACID transactions.
  • You can use Query Result Reuse to reduce cost, where the source data doesn’t change frequently. Query Result Reuse takes advantage of the fact that Athena writes query results to Amazon S3 as a CSV file.
  • UNLOAD command:
    • Writes query results from a SELECT statement to the specified data format. Supported formats includes Apache Parquet, ORC, Apache Avro, and JSON
  • Federated Passthrough Queries
    • Run queries on federated data sources using the query language of the data source itself and push the full query down to the data source for execution
    • Supported connectors include: DynamoDB, Oracle, PostgreSQL, OpenSearch

Partitioning means organizing data into directories (or “prefixes”) on Amazon S3 based on a particular property of the data. Such properties are called partition keys. A common partition key is the date or some other unit of time such as the year or month. However, a dataset can be partitioned by more than one key. For example, data about product sales might be partitioned by date, product category, and market. Good candidates for partition keys are properties that are always or frequently used in queries and have low cardinality.

Bucketing is a way to organize the records of a dataset into categories called buckets. Good candidates for bucketing occur when you have columns that have high cardinality (that is, have many distinct values), are uniformly distributed, and that you frequently query for specific values.

Workgroups

  • Use workgroups to separate workloads, control team access, enforce configuration, and track query metrics and control costs.
  • Each workgroup has its own query history and a list of saved queries for the queries in that workgroup
  • Can choose two (2) types of analytical engine:
    1. Athena SQL
      • good for running ad-hoc queries
    2. Apache Spark
      • good for running Jupyter notebook applications using Python and Apache Spark
  • Separate your workloads
    • example, you can create two independent workgroups, one for automated scheduled applications, such as report generation, and another for ad-hoc usage by analysts.
  • Control access by teams
    • you can use resource-level identity-based policies to control who can access a workgroup and run queries in it. 
    • you can isolate team by creating a workgroup for each team.
    • use IAM policies to control access to workgroup
  • Enforce configuration
    • track query metrics, query events.
    • set data limits on the query per query per workgroup
    • set capacity
  • Authentication:
    • IAM
    • IAM Identity Center
  • Provisioned Capacity: (Only on Athena Engine):
    • You can use capacity reservations to get dedicated processing capacity for your Athena queries.
    • To get processing capacity for your queries, you create a capacity reservation, specify the number of Data Processing Units (DPUs) you require, and assign one or more workgroups to the reservation.
    • With provisioned capacity, you provision a dedicated set of compute resources to run your queries. 
  • Two (2) types of cost controls:
    1. per-query limit
      • specifies the total amount of data scanned per query. If any query that runs in the workgroup exceeds the limit, it is canceled.
    2. per-workgroup limit
      • specifies the total amount of data scanned for all queries that run in this workgroup during the specified time period.
      • If the aggregate amount of data scanned exceeds the threshold, alarm is sent via SNS.

Tuning

  • https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
  • https://www.youtube.com/watch?v=iUD5pPpcyZk
  • Specify the larger table on the left side of the join and the smaller table on the right side for equality joins.
    • For the most common type of joins that use equality conditions, Athena builds a lookup table from the table on the right and distributes it to the worker nodes. It then streams the table on the left, joining rows by looking up matching values in the lookup table. This is called a distributed hash join. Because the lookup table built from the table on the right side is kept in memory, the smaller that table is, the less memory will be used, and the faster the join will run.

Leave a Comment

Your email address will not be published. Required fields are marked *