Supports document (JSON, XML,HTML) or key-value data model
Supports transaction
Serverless, fully managed and replicates across AZs
Can provide Eventual, Strong or Transactional consistency model
Low latency (single digit)
Data are queried through keys
Use IAM for authentication
Can only store 400KB of data (item size)
Provisioned Throughput(Read/Write Capacities)
How much data can be read/written to a table
Transactional requires 2X capacity of strong consistency model.
RCU/WCU are spread across partitions.
Capacity Units:
1 WCU (Write Capacity Unit) = 1KB write/sec
e.g. Need to write 5 items in 1 sec with 4KB per item
5 x 4 = 20KB / (1 KB write/sec) = 20WCU is required or 40 WCU (if transactional)
e.g. Need to write 2 items in 1 sec with 2.5KB per item
2 x 3(round to next kB) = 6KB/(1 KB write/sec) = 6 WCU is required
1 RCU (Read Capacity Unit) = 4KB ( strong read)/sec) or (eventual consistency is 2x of strong)
Trick:
Think in terms of strong consistent read i.e. Strong Consistent RCU is 4KB
Think how many you need per 1 item. Think of Strong Consistent RCU as a box that can accommodate 4KB.
Roundup the size to the nearest 4KB.
e.g. 10 strong read/sec with size 4KB per item
(4/4) x 10 = 40KB /(4 KB read/sec) = 10 RCU
e.g. 16 eventual read/sec with the size of 12KB per item
(12/4 round up) x 16 = 48/2(since this is eventual) = 24 WCU or 48 WCU (if transactional)
e.g. 12 strong read/sec with size 10 KB per item
(10/4 round up) x 12 = 36 WCU
Note:
You need to read 10KB per item. So you need 3 boxes of 4 KB i.e. (4 + 4 +4 = 12).
But you have 12 items to read
So 12 x 3 = 36
e.g. 10 eventual read/sec with a size 13 KB per item
(13/4 round up) x 10 = 40 /2 = 20 WCU
Note:
You need to read 13KB per item. So you need 4 boxes of 4 KB i.e. ( 4 x 4 = 16 KB to store 13KB)
But you have 10 items to read
So 10 x 4 = 40 WCU
But this is eventual so you only need half i.e. 40/2 = 20 WCU
Capacity Modes:
Provisioned
Need to provision ahead of time the WCU and RCU
Pay based on the provisioned WCU and RCU
Can enable auto-scaling.
On-Demand
Scale up or down based on the workload
Pay per request model (e.g. unknown workload, spiky load)
More expensive
DAX
Write through cache. Data is written to both DAX and DynamoDB
Micro-second latency read
Reads is eventually consistent. Not suitable if require eventual consistency.
5 minutes TTL default. After TTL will read the DB again.
Not suitable for write-intensive operation
Keys and Indices:
Two (2) types of Primary key:
Partition Key
Composite Key (Partition Key + Sort Key)
Indices:
Secondary Index
Can be created only when the table is created. Cannot be modified later on
Uses the same Partition Key but a different Sort Key
Global Secondary Index
Can be created anytime,
Can use a different Partition Key or Sort Key
Has its own RCU/WCU. But if the writes are throttled, the write to the main table is also throttled.
Only supports eventual consistency
A hot partition can cause throttling if the partition limits of 3000 RCU or 1000 WCU (or a combination of both) per second are exceeded.
DB Streams
Time-ordered sequence or stream
Records CRUD operation in the stream.
Stores in a log for 24hours
Mainly used to trigger events (e.g. trigger Lambda)
Has a separate endpoint
Can store before and after the change
Global Table
Multi-way replication across regions
All copies are Active i.e. application can read or write on any region
Requires DB streams
TTL
Defines expiry time of the data
Once passed expiry data is marked for deletion
Guaranteed to be deleted with 48 hours of expiration
Good for removing old or irrelevant data
Help reduce the storage requirement (and cost)
API
Items:
PutItem — Creates a new item, or replaces an old item with a new item
GetItem — Returns a set of attributes for the item with the given primary key
UpdateItem — Edits an existing item’s attributes, or adds a new item to the table if it does not already exist.
DeleteItem — Deletes a single item in a table by primary key.
BatchGetItem — Read up to 100 items from one or more tables.
BatchWriteItem — Create or delete up to 25 items in one or more tables.
Projection Expression is a string that identifies the attributes that you want (SELECT <projection expression – list of columns> from ..)
Query(Collections)
The Query operation in Amazon DynamoDB finds items based on primary key values.
Has Filter Expression – determines which items within the Query results should be returned. (SELECT <projection expression – list of columns> from X where <Filter Expression>...)
Can Limit the number of items that it reads.
Returns LastEvaluatedKey
Has Pagination – Query results are divided into “pages” of data that are 1 MB in size (or less)
Scans
A Scan operation in Amazon DynamoDB reads every item in a table or a secondary index.
Can use ProjectionExpression to limit the attribute
Has Filter Expression (see Query)
Has Limit
Returns LastEvaluatedKey
Has Pagination
Returns NextToken if –max-items is used
ElasticCache
Managed Redis/Memcached
In-memory key/value store
sub-millisecond latency
Supports clustering and Multi-AZ
Sharding
Also known as partitioning, is splitting the data up by key; While replication, also known as mirroring, is to copy all data.
Useful to increase performance, reducing the hit and memory load on any one resource. Replication is useful for getting a high availability of reads.
Replication
Also known as mirroring, is to copy all data
Cluster Mode Disabled
Has a single shard, inside of which is a collection of Redis nodes; one primary read/write node and up to five secondary, read-only replica nodes.
Each read replica maintains a copy of the data from the cluster’s primary node.
Asynchronous replication mechanisms are used to keep the read replicas synchronized with the primary.
Applications can read from any node in the cluster.
Applications can write only to the primary node. Read replicas improve read throughput and guard against data loss in cases of a node failure.
Cannot convert to Cluster Mode
Cluster Mode Enabled
1 to 500 shards
Each shard has a primary node and up to five read-only replica nodes.
You cannot manually promote any of the replica nodes to primary.
You can only change the structure of a cluster, the node type, and the number of nodes by restoring from a backup.
Multi-AZ is required.
Redshift
For Data Warehousing, Analytics and BI (OLAP)
Uses PostgreSQL behind the scene.
Data can be loaded via:
Kinesis Data Firehose
S3 copy
An application using JDBC
Uses columnar storage and columnar compression.
MPP(Massive Parallel Query Execution)
Can have to 128 nodes
Backup is enabled by default and can store up to 35 days. It will try to create 3 copies of data(original, replica and S3)
Only runs on 1 AZ.
Encrypted at rest and uses SSL for in-flight
For DR need to take incremental snapshots and store them in S3.
Can configure snapshot to copy to another region
RedShift Spectrum – query data from S3 without loading it to RedShift
Glue
Managed ETL service
Serverless
Data Crawler automates the discovery of your data schema. Discovered schema can be stored into Glue Data Catalog which is used in the authoring process of your ETL jobs.
Neptune
Fully managed Graph Database (like Neo4J)
HA in 3 AZ and clustering
Has IAM authentication
Athena
An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL
Serverless
No need to perform ETL to analyze data
The data format can be CSV, JSON, ORC, Apache Parquet and Avro
Use Presto engine
Output to S3 so need to have S3 security
OpenSearch (previously ElasticSearch service)
A distributed, open-source search and analytics suite.
Based on Apache Lucene
OpenSearch Dashboards were originally derived from Elasticsearch 7.10.2 and Kibana 7.10.2
EMR (Elastic Map Reduce)
Big data platform for data processing
Help create a Hadoop cluster with hundreds of EC2 instances.
Deploy workloads to EMR using Amazon EC2, Amazon Elastic Kubernetes Service (EKS), or on-premises AWS Outposts.
Has auto-scaling and integrate with Spot Instances
Uses open-source frameworks such as Apache Spark, Apache Hive, and Presto.
DMS (Data Migration Service)
Can perform homogenous (same DB type) and heterogeneous(diff DB type migration
Source and destination can be in AWS or on-prem,
Requires a replication instance (EC2) to run the migration task
Can create a task that captures ongoing changes after you complete your initial (full-load) migration to a supported target data store (CDC – Change Data Capture)
Can use SCT (Schema Conversion Tool) if the source and destination DB are of different engines. SCT is a separate program
AWS Data Sync
Automates and accelerates moving data between on-premises and AWS storage service
Uses NFS/SMB or S3 API or HDFS via an agent running on a VM, Snowcone or S3 Outpost to move date
Transfer data between AWS Storage services so you can replicate, archive, or share application data easily.
Synchronization is scheduled (e.g hourly, daily, weekly)