AWS Solution Architect Associate (SAA-C02) Review Material – Other Data/Database Services

DyanamoDB

  • General
    • Low latency NoSQL database
    • Supports document (JSON, XML,HTML) or key-value data model
    • Supports transaction
    • Serverless, fully managed and replicates across AZs
    • Can provide Eventual, Strong or Transactional consistency model
    • Low latency (single digit)
    • Data are queried through keys
    • Use IAM for authentication
    • Can only store 400KB of data (item size)
  • Provisioned Throughput(Read/Write Capacities)
    • How much data can be read/written to a table
    • Transactional requires 2X capacity of strong consistency model.
    • RCU/WCU are spread across partitions.
    • Capacity Units:
      • 1 WCU (Write Capacity Unit) = 1KB write/sec
        • e.g. Need to write 5 items in 1 sec with 4KB per item
          • 5 x 4 = 20KB / (1 KB write/sec) = 20WCU is required or 40 WCU (if transactional)
        • e.g. Need to write 2 items in 1 sec with 2.5KB per item
          • 2 x 3(round to next kB) = 6KB/(1 KB write/sec) = 6 WCU is required
      • 1 RCU (Read Capacity Unit) = 4KB ( strong read)/sec) or (eventual consistency is 2x of strong)
        • Trick:
          • Think in terms of strong consistent read i.e. Strong Consistent RCU is 4KB
          • Think how many you need per 1 item. Think of Strong Consistent RCU as a box that can accommodate 4KB.
          • Roundup the size to the nearest 4KB.
        • e.g. 10 strong read/sec with size 4KB per item
          • (4/4) x 10 = 40KB /(4 KB read/sec) = 10 RCU
          • e.g. 16 eventual read/sec with the size of 12KB per item
          • (12/4 round up) x 16 = 48/2(since this is eventual) = 24 WCU or 48 WCU (if transactional)
        • e.g. 12 strong read/sec with size 10 KB per item
          • (10/4 round up) x 12 = 36 WCU
          • Note:
            • You need to read 10KB per item. So you need 3 boxes of 4 KB i.e. (4 + 4 +4 = 12).
            • But you have 12 items to read
            • So 12 x 3 = 36
        • e.g. 10 eventual read/sec with a size 13 KB per item
          • (13/4 round up) x 10 = 40 /2 = 20 WCU
          • Note:
            • You need to read 13KB per item. So you need 4 boxes of 4 KB i.e. ( 4 x 4 = 16 KB to store 13KB)
            • But you have 10 items to read
            • So 10 x 4 = 40 WCU
            • But this is eventual so you only need half i.e. 40/2 = 20 WCU
  • Capacity Modes:
    • Provisioned
      • Need to provision ahead of time the WCU and RCU
      • Pay based on the provisioned WCU and RCU
    • Can enable auto-scaling.
    • On-Demand
      • Scale up or down based on the workload
      • Pay per request model (e.g. unknown workload, spiky load)
      • More expensive
  • DAX
    • Write through cache. Data is written to both DAX and DynamoDB
    • Micro-second latency read
    • Reads is eventually consistent. Not suitable if require eventual consistency.
    • 5 minutes TTL default. After TTL will read the DB again.
    • Not suitable for write-intensive operation
  • Keys and Indices:
    • Two (2) types of Primary key:
      1. Partition Key
      2. Composite Key (Partition Key + Sort Key)
    • Indices:
      • Secondary Index
        • Can be created only when the table is created. Cannot be modified later on
        • Uses the same Partition Key but a different Sort Key
      • Global Secondary Index
        • Can be created anytime,
        • Can use a different Partition Key or Sort Key
        • Has its own RCU/WCU. But if the writes are throttled, the write to the main table is also throttled.
        • Only supports eventual consistency
    • A hot partition can cause throttling if the partition limits of 3000 RCU or 1000 WCU (or a combination of both) per second are exceeded.
  • DB Streams
    • Time-ordered sequence or stream
    • Records CRUD operation in the stream.
    • Stores in a log for 24hours
    • Mainly used to trigger events (e.g. trigger Lambda)
    • Has a separate endpoint
    • Can store before and after the change
  • Global Table
    • Multi-way replication across regions
    • All copies are Active i.e. application can read or write on any region
    • Requires DB streams
  • TTL
    • Defines expiry time of the data
    • Once passed expiry data is marked for deletion
    • Guaranteed to be deleted with 48 hours of expiration
    • Good for removing old or irrelevant data
    • Help reduce the storage requirement (and cost)
  • API
    • Items:
      • PutItem — Creates a new item, or replaces an old item with a new item
      • GetItem — Returns a set of attributes for the item with the given primary key
      • UpdateItem — Edits an existing item’s attributes, or adds a new item to the table if it does not already exist.
      • DeleteItem — Deletes a single item in a table by primary key. 
      • BatchGetItem — Read up to 100 items from one or more tables.
      • BatchWriteItem — Create or delete up to 25 items in one or more tables.
      • Projection Expression is a string that identifies the attributes that you want (SELECT <projection expression – list of columns> from ..)
    • Query(Collections)
      • The Query operation in Amazon DynamoDB finds items based on primary key values.
      • Has Filter Expression – determines which items within the Query results should be returned. (SELECT <projection expression – list of columns> from X where <Filter Expression>...)
      • Can Limit the number of items that it reads.
        • Returns LastEvaluatedKey
      • Has Pagination – Query results are divided into “pages” of data that are 1 MB in size (or less)
    • Scans
      • Scan operation in Amazon DynamoDB reads every item in a table or a secondary index.
      • Can use ProjectionExpression to limit the attribute
      • Has Filter Expression (see Query)
      • Has Limit
        • Returns LastEvaluatedKey
      • Has Pagination
        • Returns NextToken if –max-items is used

ElasticCache

  • Managed Redis/Memcached
  • In-memory key/value store
  • sub-millisecond latency
  • Supports clustering and Multi-AZ
  • Sharding
    • Also known as partitioning, is splitting the data up by key; While replication, also known as mirroring, is to copy all data.
    • Useful to increase performance, reducing the hit and memory load on any one resource. Replication is useful for getting a high availability of reads.
  • Replication
    • Also known as mirroring, is to copy all data
  • Cluster Mode Disabled
    • Has a single shard, inside of which is a collection of Redis nodes; one primary read/write node and up to five secondary, read-only replica nodes.
    • Each read replica maintains a copy of the data from the cluster’s primary node.
    • Asynchronous replication mechanisms are used to keep the read replicas synchronized with the primary.
    • Applications can read from any node in the cluster.
    • Applications can write only to the primary node. Read replicas improve read throughput and guard against data loss in cases of a node failure.
    • Cannot convert to Cluster Mode
  • Cluster Mode Enabled
    •  1 to 500 shards 
    • Each shard has a primary node and up to five read-only replica nodes. 
    • You cannot manually promote any of the replica nodes to primary.
    • You can only change the structure of a cluster, the node type, and the number of nodes by restoring from a backup. 
    • Multi-AZ is required.

Redshift

  • For Data Warehousing, Analytics and BI (OLAP)
  • Uses PostgreSQL behind the scene.
  • Data can be loaded via:
    1. Kinesis Data Firehose
    2. S3 copy
    3. An application using JDBC
  • Uses columnar storage and columnar compression.
  • MPP(Massive Parallel Query Execution)
  • Can have to 128 nodes
  • Backup is enabled by default and can store up to 35 days. It will try to create 3 copies of data(original, replica and S3)
  • Only runs on 1 AZ.
  • Encrypted at rest and uses SSL for in-flight
  • For DR need to take incremental snapshots and store them in S3.
  • Can configure snapshot to copy to another region
  • RedShift Spectrum – query data from S3 without loading it to RedShift

Glue

  • Managed ETL service
  • Serverless
  • Data Crawler automates the discovery of your data schema. Discovered schema can be stored into Glue Data Catalog which is used in the authoring process of your ETL jobs.

Neptune

  • Fully managed Graph Database (like Neo4J)
  • HA in 3 AZ and clustering
  • Has IAM authentication

Athena

  • An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL
  • Serverless
  • No need to perform ETL to analyze data
  • The data format can be CSV, JSON, ORC, Apache Parquet and Avro
  • Use Presto engine
  • Output to S3 so need to have S3 security

OpenSearch (previously ElasticSearch service)

  • A distributed, open-source search and analytics suite.
  • Based on Apache Lucene
  • OpenSearch Dashboards were originally derived from Elasticsearch 7.10.2 and Kibana 7.10.2

EMR (Elastic Map Reduce)

  • Big data platform for data processing
  • Help create a Hadoop cluster with hundreds of EC2 instances.
  • Deploy workloads to EMR using Amazon EC2, Amazon Elastic Kubernetes Service (EKS), or on-premises AWS Outposts.
  • Has auto-scaling and integrate with Spot Instances
  • Uses open-source frameworks such as Apache Spark, Apache Hive, and Presto.

DMS (Data Migration Service)

  • Can perform homogenous (same DB type) and heterogeneous(diff DB type migration
  • Source and destination can be in AWS or on-prem,
  • Requires a replication instance (EC2) to run the migration task
  • Can create a task that captures ongoing changes after you complete your initial (full-load) migration to a supported target data store (CDC – Change Data Capture)
  • Can use SCT (Schema Conversion Tool) if the source and destination DB are of different engines. SCT is a separate program

AWS Data Sync

  • Automates and accelerates moving data between on-premises and AWS storage service
    • Uses NFS/SMB or S3 API or HDFS via an agent running on a VM, Snowcone or S3 Outpost to move date
  • Transfer data between AWS Storage services so you can replicate, archive, or share application data easily.
  • Synchronization is scheduled (e.g hourly, daily, weekly)

Leave a Comment

Your email address will not be published. Required fields are marked *