It helps you consume and process data from a Kinesis data stream by taking care of many of the complex tasks associated with distributed computing.
It provides a layer of abstraction around KDS API (AWS SDK)
Uses a concept of Lease – data that defines the binding between a worker and a shard.
Uses a Lease table – a unique Amazon DynamoDB table that is used to keep track of the shards in a KDS data stream that are being leased and processed by the workers of the KCL
Each KCL application must use its own DynamoDB table.
Shards
Uniquely identified sequence of data records in a stream.
Reads:
Each shard can support up to 5 transactions per second.
Maximum data read at an enhancedfan-out mode of 2 MB per second.
Maximum total data read rate of 2 MB per second.
Writes:
Each shard can support up to 1000 transactions per second.
Maximum total data read rate of 1 MB per second.
Partition key:
Used to group data by shard within a stream.
It is associated with each data record to determine which shard a given data record belongs to.
An MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards using the hash key ranges of the shards.
Sequence Number:
Unique per partition-key within its shard
KDS assigns the sequence number after you write to the stream with client.putRecords or client.putRecord.
Capacity Mode
Determines how the capacity of a data stream is managed and how you are charged for the usage of your data stream.
Two (2) Capacity Modes:
On-demand
It requires no capacity planning and automatically scales to handle gigabytes of write and read throughput per minute.
Automatically manages the shards in order to provide the necessary throughput.
It is ideal for addressing the needs of highly variable and unpredictable application traffic.
You pay per GB of data written and read from your data streams.
Provisioned
You specify the number of shards needed for the data stream.
The total capacity of a data stream is the sum of the capacities of its shards.
You can increase or decrease the number of shards in a data stream as needed.
You pay by shard hour and PUT Payload Units
Resharding a Stream
It lets you adjust the number of shards in your stream to adapt to changes in the rate of data flow through the stream.
Two (2) types of operations:
Split:
Divide a single shard into two shards.
Increases the number of shards in your stream and therefore increases the data capacity of the stream.
Increases the cost of your stream.
Merge:
Combine two shards into a single shard.
Reduces the number of shards in your stream and therefore decreases the data capacity.
Reduces the cost of your stream.
Resharding is always pairwise, i.e. cannot split into more than two shards in a single operation, cannot merge more than two shards in a single operation.
Kinesis Data Firehose (KDF)
Overview
A managed service for delivering real-time streaming data to a number of destinations.
Sources:
Kinesis Agent
AWS SDK
CW Log
CW Event
AWS IoT
KDS
More than one Firehose stream can read from the same Kinesis stream.
Firehose’s PutRecord and PutRecordBatch operations will be disabled, so an agent cannot simultaneously write.
Destinations:
S3
RedShift
Amazon Open Search
Splunk
Snowflake
Custom HTTP Endpoint
Can convert data from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3
Near realtime
In rare circumstances, such as a request timeout upon a data delivery attempt, a delivery retry by Firehose could introduce duplicates if the previous request eventually goes through.
Support compression for S3 target
Automatic Scaling
You pay for the volume of data you ingest into the service
It can invoke a Lambda function to transform incoming source data. Several blueprints are available. (only supports invocation time of up to 5 minutes)
It uses the IAM Policy to grant access to the KDF.
Buffers incoming streaming data to a certain size and for a certain period of time before delivering it to the specified destinations.
Buffering Hints:
buffer size – measured in MBs (1 MB to 128 MB)
buffer interval – measured in seconds. (60 to 900 seconds)
When Kinesis Data Firehose’s delivery stream scales, it can cause an effect on the buffering hints of Data Firehose.
Data payload size limit is 1MB
Data Delivery
S3
concatenates multiple incoming records based on the buffering configuration
delivers the records to Amazon S3 as an Amazon S3 object
any lambda transformation failure will also be written to an S3 bucket
A single Firehose stream can currently only deliver data to one Amazon S3 bucket
Redshift
first, it delivers incoming data to your S3 bucket
Then, an Amazon Redshift COPY command will be issued to load the data from your S3 bucket.
Need to unblock KDF IP if Redshift is in a VPC
OpenSearch
buffers incoming records based on the buffering configuration
Then, it generates an OpenSearch Service or OpenSearch Serverless bulk request to index multiple records to your OpenSearch service cluster.
Firehose stream and destination Amazon OpenSearch Service domain need to be in the same region
Splunk
concatenates the bytes that were sent
HTTP endpoint
can use the integrated Amazon Lambda service to create a function to transform the incoming record(s) to the format that matches the format the service provider’s integration is expecting
Snowflake
internally buffers data for one second and uses Snowflake streaming API operations to insert data to Snowflake
can only deliver data to one Snowflake table
Kinesis Data Analytics (KDA)
Overview
Two (2) types of service offered:
KDA for SQL Application (EOL 2026)
Managed Service for Apache Flink
Managed Service for Apache Flink
Process data streams in real-time with SQL or Apache Flink.
Support Java, Python & Scala
Applications primarily use either the DataStream API or the Table API.
Connectors:
Software components that move data into and out of an Amazon Managed Service for Apache Flink application
Sample connectors:
KDS
KDF
Kafka
OpenSearch
DynamoDB
KDA SQL
Support standard SQL operators such as SELECT, INSERT, CREATE
It provides a number of functions, such as:
Aggregate
Analytic
Date & Time
Statistical Variance and Deviation e.g.
RANDOM_CUT_FOREST
Detects anomalies in your data stream. A record is an anomaly if it is distant from other records.
You pay mainly by KPU (Kinesis Processing Unit), per Hour.
Hands-On
1. Cross-account streaming of CW Logs using KDS
For this hands-on, you need two (2) AWS Accounts: (1) a sending account and (2) a recipient account.
Perform the following in the recipient account:
Create a Kinesis Data Stream (KDS):
Create a role with the following policies:
Create the following trust policy:
Create the following permission policy:
Create a CW Log destination:
Execute the following from the command line:
$ aws logs put-destination --destination-name "mylinuxsite-dea-c01" --target-arn "<kds arn>" --role-arn "<arn of role created in #2>"
Take note of the destination arn that will be displayed as part of the output of this command.
Create an access policy file containing the following texts.
The ‘Resource’ is the destination arn outputted from the previous command.
Write a simple lambda function that the Kinesis Data Stream triggers. The lambda will read the data and print it out in the CW log.
Perform the following in the sender account:
Select one of the log groups and add a subscription filter. In this example, the log group is from a lambda function.
Use the destination arn that was created in the previous step.
Generate a log in your chosen log group.
Check the lambda function log created in the previous step in the recipient account. It should output the same text as what was record in the sender account lambda log.
2. Dumping Cross-account CW Logs to S3 Using KDF.
We will use the same setup we used in Hands-on 1, but we will send the logs to KDF this time.
Create a Kinesis Data Firehose (KDF) with the source as the KDS in Hands-on 1 and the destination as an S3 bucket.
Generate a log in the log group used in Hands-on 1.
Check the destination S3 bucket if a file is created. If you did not modify the KDF buffer interval default value, you may need to wait 5 minutes.