AWS Solution Architect Associate (SAA-C02) Review Material – Other Data/Database Services

DyanamoDB

General
- Low latency NoSQL database
- Supports document (JSON, XML,HTML) or key-value data model
- Supports transaction
- Serverless, fully managed and replicates across AZs
- Can provide Eventual, Strong or Transactional consistency model
- Low latency (single digit)
- Data are queried through keys
- Use IAM for authentication
- Can only store 400KB of data (item size)
Provisioned Throughput(Read/Write Capacities)
- How much data can be read/written to a table
- Transactional requires 2X capacity of strong consistency model.
- RCU/WCU are spread across partitions.
- Capacity Units:
  - 1 WCU (Write Capacity Unit) = 1KB write/sec
    - e.g. Need to write 5 items in 1 sec with 4KB per item
      - 5 x 4 = 20KB / (1 KB write/sec) = 20WCU is required or 40 WCU (if transactional)
    - e.g. Need to write 2 items in 1 sec with 2.5KB per item
      - 2 x 3(round to next kB) = 6KB/(1 KB write/sec) = 6 WCU is required
  - 1 RCU (Read Capacity Unit) = 4KB ( strong read)/sec) or (eventual consistency is 2x of strong)
    - Trick:
      - Think in terms of strong consistent read i.e. Strong Consistent RCU is 4KB
      - Think how many you need per 1 item. Think of Strong Consistent RCU as a box that can accommodate 4KB.
      - Roundup the size to the nearest 4KB.
    - e.g. 10 strong read/sec with size 4KB per item
      - (4/4) x 10 = 40KB /(4 KB read/sec) = 10 RCU
      - e.g. 16 eventual read/sec with the size of 12KB per item
      - (12/4 round up) x 16 = 48/2(since this is eventual) = 24 WCU or 48 WCU (if transactional)
    - e.g. 12 strong read/sec with size 10 KB per item
      - (10/4 round up) x 12 = 36 WCU
      - Note:
        You need to read 10KB per item. So you need 3 boxes of 4 KB i.e. (4 + 4 +4 = 12).
        But you have 12 items to read
        So 12 x 3 = 36
    - e.g. 10 eventual read/sec with a size 13 KB per item
      - (13/4 round up) x 10 = 40 /2 = 20 WCU
      - Note:
        You need to read 13KB per item. So you need 4 boxes of 4 KB i.e. ( 4 x 4 = 16 KB to store 13KB)
        But you have 10 items to read
        So 10 x 4 = 40 WCU
        But this is eventual so you only need half i.e. 40/2 = 20 WCU

Capacity Modes:
- Provisioned
  - Need to provision ahead of time the WCU and RCU
  - Pay based on the provisioned WCU and RCU
- Can enable auto-scaling.
- On-Demand
  - Scale up or down based on the workload
  - Pay per request model (e.g. unknown workload, spiky load)
  - More expensive
DAX
- Write through cache. Data is written to both DAX and DynamoDB
- Micro-second latency read
- Reads is eventually consistent. Not suitable if require strong consistency.
- 5 minutes TTL default. After TTL will read the DB again.
- Not suitable for write-intensive operation
- Inside a VPC
Keys and Indices:
- Two (2) types of Primary key:
  1. Partition Key
  2. Composite Key (Partition Key + Sort Key)
- Indices:
  - Secondary Index
    - Can be created only when the table is created. Cannot be modified later on
    - Uses the same Partition Key but a different Sort Key
  - Global Secondary Index
    - Can be created anytime,
    - Can use a different Partition Key or Sort Key
    - Has its own RCU/WCU. But if the writes are throttled, the write to the main table is also throttled.
    - Only supports eventual consistency
- A hot partition can cause throttling if the partition limits of 3000 RCU or 1000 WCU (or a combination of both) per second are exceeded.

DB Streams
- Time-ordered sequence or stream
- Records CRUD operation in the stream.
- Stores in a log for 24hours
- Mainly used to trigger events (e.g. trigger Lambda)
- Has a separate endpoint
- Can store before and after the change
Global Table
- Multi-way replication across regions
- All copies are Active i.e. application can read or write on any region
- Requires DB streams
TTL
- Defines expiry time of the data
- Once passed expiry data is marked for deletion
- Guaranteed to be deleted with 48 hours of expiration
- Good for removing old or irrelevant data
- Help reduce the storage requirement (and cost)
API
- Items:
  - PutItem — Creates a new item, or replaces an old item with a new item
  - GetItem — Returns a set of attributes for the item with the given primary key
  - UpdateItem — Edits an existing item’s attributes, or adds a new item to the table if it does not already exist.
  - DeleteItem — Deletes a single item in a table by primary key.
  - BatchGetItem — Read up to 100 items from one or more tables.
  - BatchWriteItem — Create or delete up to 25 items in one or more tables.
  - Projection Expression is a string that identifies the attributes that you want (SELECT <projection expression – list of columns> from ..)
- Query(Collections)
  - The Query operation in Amazon DynamoDB finds items based on primary key values.
  - Has Filter Expression – determines which items within the Query results should be returned. (SELECT <projection expression – list of columns> from X where <Filter Expression>...)
  - Can Limit the number of items that it reads.
    - Returns LastEvaluatedKey
  - Has Pagination – Query results are divided into “pages” of data that are 1 MB in size (or less)
- Scans
  - A Scan operation in Amazon DynamoDB reads every item in a table or a secondary index.
  - Can use ProjectionExpression to limit the attribute
  - Has Filter Expression (see Query)
  - Has Limit
    - Returns LastEvaluatedKey
  - Has Pagination
    - Returns NextToken if –max-items is used

ElasticCache

Managed Redis/Memcached
In-memory key/value store
sub-millisecond latency
Supports clustering and Multi-AZ
Sharding
- Also known as partitioning, is splitting the data up by key; While replication, also known as mirroring, is to copy all data.
- Useful to increase performance, reducing the hit and memory load on any one resource. Replication is useful for getting a high availability of reads.
Replication
- Also known as mirroring, is to copy all data
Cluster Mode Disabled
- Has a single shard, inside of which is a collection of Redis nodes; one primary read/write node and up to five secondary, read-only replica nodes.
- Each read replica maintains a copy of the data from the cluster’s primary node.
- Asynchronous replication mechanisms are used to keep the read replicas synchronized with the primary.
- Applications can read from any node in the cluster.
- Applications can write only to the primary node. Read replicas improve read throughput and guard against data loss in cases of a node failure.
- Cannot convert to Cluster Mode
Cluster Mode Enabled
- 1 to 500 shards
- Each shard has a primary node and up to five read-only replica nodes.
- You cannot manually promote any of the replica nodes to primary.
- You can only change the structure of a cluster, the node type, and the number of nodes by restoring from a backup.
- Multi-AZ is required.

Redshift

For Data Warehousing, Analytics and BI (OLAP)
Uses PostgreSQL behind the scene.
Data can be loaded via:
1. Kinesis Data Firehose
2. S3 copy
3. An application using JDBC
Uses columnar storage and columnar compression.
MPP(Massive Parallel Query Execution)
Can have to 128 nodes
Backup is enabled by default and can store up to 35 days. It will try to create 3 copies of data(original, replica and S3)
Only runs on 1 AZ.
Encrypted at rest and uses SSL for in-flight
For DR need to take incremental snapshots and store them in S3.
Can configure snapshot to copy to another region
RedShift Spectrum – query data from S3 without loading it to RedShift

Glue

Managed ETL service
Serverless
Data Crawler automates the discovery of your data schema. Discovered schema can be stored into Glue Data Catalog which is used in the authoring process of your ETL jobs.

Neptune

Fully managed Graph Database (like Neo4J)
HA in 3 AZ and clustering
Has IAM authentication

Athena

An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL
Serverless
No need to perform ETL to analyze data
The data format can be CSV, JSON, ORC, Apache Parquet and Avro
Use Presto engine
Output to S3 so need to have S3 security

OpenSearch (previously ElasticSearch service)

A distributed, open-source search and analytics suite.
Based on Apache Lucene
OpenSearch Dashboards were originally derived from Elasticsearch 7.10.2 and Kibana 7.10.2

EMR (Elastic Map Reduce)

Big data platform for data processing
Help create a Hadoop cluster with hundreds of EC2 instances.
Deploy workloads to EMR using Amazon EC2, Amazon Elastic Kubernetes Service (EKS), or on-premises AWS Outposts.
Has auto-scaling and integrate with Spot Instances
Uses open-source frameworks such as Apache Spark, Apache Hive, and Presto.

DMS (Data Migration Service)

Can perform homogenous (same DB type) and heterogeneous(diff DB type migration
Source and destination can be in AWS or on-prem,
Requires a replication instance (EC2) to run the migration task
Can create a task that captures ongoing changes after you complete your initial (full-load) migration to a supported target data store (CDC – Change Data Capture)
Can use SCT (Schema Conversion Tool) if the source and destination DB are of different engines. SCT is a separate program

AWS Data Sync

Automates and accelerates moving data between on-premises and AWS storage service
- Uses NFS/SMB or S3 API or HDFS via an agent running on a VM, Snowcone or S3 Outpost to move date
Transfer data between AWS Storage services so you can replicate, archive, or share application data easily.
Synchronization is scheduled (e.g hourly, daily, weekly)

DyanamoDB

ElasticCache

Redshift

Glue

Neptune

Athena

OpenSearch (previously ElasticSearch service)

EMR (Elastic Map Reduce)

DMS (Data Migration Service)

AWS Data Sync

Leave a Comment Cancel Reply