AWS Certified Data Engineer Associate (DEA-C01) Review Material – Kinesis

Kinesis Data Stream (KDS)

Overview

Used to process large streams of data records in real-time.
Retain data up to 365 days.
Data sent in the stream are immutable.
KDS is a public service (cannot be placed inside a VPC)

High-Level Architecture

KDS
- A Kinesis data stream is a set of shards.
- Each shard has a sequence of data records.
- Each data record has a sequence number that is assigned by Kinesis Data Streams.
Producer
- Put records into Amazon Kinesis Data Streams
- Types of Producers:
  - AWS SDK
  - KPL
    - Automated and configurable retry mechanism
    - Support C++ or Java
  - Kinesis Agents
  - AWS Managed Services w/ built-in producer function:
    - Cloudwatch
    - AWS IoT
    - KDA
- KPL Batching
  - Performs a single action on multiple items instead of repeatedly performing the action on each individual item.
  - It can incur a delay due to its buffering action.
  - Two Types:
    1. Aggregation
      - Storing multiple records within a single Kinesis Data Streams record.
      - Aggregation allows customers to increase the number of records sent per API call, which effectively increases producer throughput.
    2. Collection
      1. Batching multiple Kinesis Data Streams records and sending them in a single HTTP request with a call to the API operation PutRecords
Consumers
- Get records from Amazon Kinesis Data Streams and process them
- Types of Consumers:
  - AWS SDK
  - KCL
  - AWS Managed Services w/ built-in consumer function:
    - Lambda
    - KDF
    - KDA
- KCL:
  - It helps you consume and process data from a Kinesis data stream by taking care of many of the complex tasks associated with distributed computing.
  - It provides a layer of abstraction around KDS API (AWS SDK)
  - Uses a concept of Lease – data that defines the binding between a worker and a shard.
  - Uses a Lease table – a unique Amazon DynamoDB table that is used to keep track of the shards in a KDS data stream that are being leased and processed by the workers of the KCL
  - Each KCL application must use its own DynamoDB table.
Shards
- Uniquely identified sequence of data records in a stream.
- Reads:
  - Each shard can support up to 5 transactions per second.
  - Maximum data read at an enhanced fan-out mode of 2 MB per second.
  - Maximum total data read rate of 2 MB per second.
- Writes:
  - Each shard can support up to 1000 transactions per second.
  - Maximum total data read rate of 1 MB per second.
- Partition key:
  - Used to group data by shard within a stream.
  - It is associated with each data record to determine which shard a given data record belongs to.
  - An MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards using the hash key ranges of the shards.
- Sequence Number:
  - Unique per partition-key within its shard
  - KDS assigns the sequence number after you write to the stream with client.putRecords or client.putRecord.

Capacity Mode

Determines how the capacity of a data stream is managed and how you are charged for the usage of your data stream.
Two (2) Capacity Modes:
1. On-demand
  - It requires no capacity planning and automatically scales to handle gigabytes of write and read throughput per minute.
  - Automatically manages the shards in order to provide the necessary throughput.
  - It is ideal for addressing the needs of highly variable and unpredictable application traffic.
  - You pay per GB of data written and read from your data streams.
2. Provisioned
  - You specify the number of shards needed for the data stream.
  - The total capacity of a data stream is the sum of the capacities of its shards.
  - You can increase or decrease the number of shards in a data stream as needed.
  - You pay by shard hour and PUT Payload Units

Resharding a Stream

It lets you adjust the number of shards in your stream to adapt to changes in the rate of data flow through the stream.
Two (2) types of operations:
1. Split:
  - Divide a single shard into two shards.
  - Increases the number of shards in your stream and therefore increases the data capacity of the stream.
  - Increases the cost of your stream.
2. Merge:
  - Combine two shards into a single shard.
  - Reduces the number of shards in your stream and therefore decreases the data capacity.
  - Reduces the cost of your stream.
Resharding is always pairwise, i.e. cannot split into more than two shards in a single operation, cannot merge more than two shards in a single operation.

Kinesis Data Firehose (KDF)

Overview

A managed service for delivering real-time streaming data to a number of destinations.
Sources: https://docs.aws.amazon.com/firehose/latest/dev/create-name.html
- Kinesis Agent
- AWS SDK
- CW Log
- CW Event
- AWS IoT
- WAF
- MSK
- KDS
  - More than one Firehose stream can read from the same Kinesis stream.
  - Firehose’s PutRecord and PutRecordBatch operations will be disabled, so an agent cannot simultaneously write.
Destinations: https://docs.aws.amazon.com/firehose/latest/dev/create-name.html
- S3
- RedShift
- Amazon Open Search
- MSK
- HTTP Endpoint
- Splunk
- Snowflake
- Custom HTTP Endpoint
Can convert data from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3
Near realtime
In rare circumstances, such as a request timeout upon a data delivery attempt, a delivery retry by Firehose could introduce duplicates if the previous request eventually goes through.
Support compression for S3 target
Automatic Scaling
You pay for the volume of data you ingest into the service
It can invoke a Lambda function to transform incoming source data. Several blueprints are available. (only supports invocation time of up to 5 minutes)
It uses the IAM Policy to grant access to the KDF.
Buffers incoming streaming data to a certain size and for a certain period of time before delivering it to the specified destinations.
- Buffering Hints:
  - buffer size – measured in MBs (1 MB to 128 MB)
  - buffer interval – measured in seconds. (60 to 900 seconds)
  - _{When Kinesis Data Firehose’s delivery stream scales, it can cause an effect on the buffering hints of Data Firehose.}
Data payload size limit is 1MB

Data Delivery

S3
- concatenates multiple incoming records based on the buffering configuration
- delivers the records to Amazon S3 as an Amazon S3 object
- any lambda transformation failure will also be written to an S3 bucket
- A single Firehose stream can currently only deliver data to one Amazon S3 bucket
Redshift
- first, it delivers incoming data to your S3 bucket
- Then, an Amazon Redshift COPY command will be issued to load the data from your S3 bucket.
- Need to unblock KDF IP if Redshift is in a VPC
OpenSearch
- buffers incoming records based on the buffering configuration
- Then, it generates an OpenSearch Service or OpenSearch Serverless bulk request to index multiple records to your OpenSearch service cluster.
- Firehose stream and destination Amazon OpenSearch Service domain need to be in the same region
Splunk
- concatenates the bytes that were sent
HTTP endpoint
- can use the integrated Amazon Lambda service to create a function to transform the incoming record(s) to the format that matches the format the service provider’s integration is expecting
Snowflake
- internally buffers data for one second and uses Snowflake streaming API operations to insert data to Snowflake
- can only deliver data to one Snowflake table

Kinesis Data Analytics (KDA)

Overview

Two (2) types of service offered:
- KDA for SQL Application (EOL 2026)
- Managed Service for Apache Flink
Managed Service for Apache Flink
- Process data streams in real-time with SQL or Apache Flink.
- Support Java, Python & Scala
- Applications primarily use either the DataStream API or the Table API.
- Connectors:
  - Software components that move data into and out of an Amazon Managed Service for Apache Flink application
  - Sample connectors:
    - KDS
    - KDF
    - Kafka
    - OpenSearch
    - DynamoDB
KDA SQL
- Support standard SQL operators such as SELECT, INSERT, CREATE
- It provides a number of functions, such as:
  - Aggregate
  - Analytic
  - Date & Time
  - Statistical Variance and Deviation e.g.
    - RANDOM_CUT_FOREST
      - Detects anomalies in your data stream. A record is an anomaly if it is distant from other records.
You pay mainly by KPU (Kinesis Processing Unit), per Hour.

Hands-On

1. Cross-account streaming of CW Logs using KDS

For this hands-on, you need two (2) AWS Accounts: (1) a sending account and (2) a recipient account.
Perform the following in the recipient account:
1. Create a Kinesis Data Stream (KDS):
2. Create a role with the following policies:
  - Create the following trust policy:
  - Create the following permission policy:
3. Create a CW Log destination:
  - Execute the following from the command line:
    - $ aws logs put-destination --destination-name "mylinuxsite-dea-c01" --target-arn "<kds arn>" --role-arn "<arn of role created in #2>"
    - Take note of the destination arn that will be displayed as part of the output of this command.
  - Create an access policy file containing the following texts.
    - The ‘Resource’ is the destination arn outputted from the previous command.
4. Write a simple lambda function that the Kinesis Data Stream triggers. The lambda will read the data and print it out in the CW log.
Perform the following in the sender account:
- Select one of the log groups and add a subscription filter. In this example, the log group is from a lambda function.
  - Use the destination arn that was created in the previous step.
- Generate a log in your chosen log group.
Check the lambda function log created in the previous step in the recipient account. It should output the same text as what was record in the sender account lambda log.

2. Dumping Cross-account CW Logs to S3 Using KDF.

We will use the same setup we used in Hands-on 1, but we will send the logs to KDF this time.
1. Create a Kinesis Data Firehose (KDF) with the source as the KDS in Hands-on 1 and the destination as an S3 bucket.
2. Generate a log in the log group used in Hands-on 1.
3. Check the destination S3 bucket if a file is created. If you did not modify the KDF buffer interval default value, you may need to wait 5 minutes.

Kinesis Data Stream (KDS)

Overview

High-Level Architecture

Capacity Mode

Resharding a Stream

Kinesis Data Firehose (KDF)

Overview

Data Delivery

Kinesis Data Analytics (KDA)

Overview

Hands-On

1. Cross-account streaming of CW Logs using KDS

2. Dumping Cross-account CW Logs to S3 Using KDF.

Leave a Comment Cancel Reply