AWS Kinesis optimal shards and cost estimation
AWS Kinesis optimal shards and cost estimation with hands on demo using producer and consumer long running tasks
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data at any scale at the most optimal costs. It supports real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data for various applications.
Calculation of optimal number of shards is important for improving the efficiency and lower the cost of the data stream.
We are going to use a simpler producer and consumer process using the code base present in this repo Kinesis shard estimation
Producer
Generates random characters, and then put the generated random characters into the stream as records.
Consumer
Gets batches of records and then seeks through the records for the search pattern and shows on terminal.
Now, install boto python package for interaction with AWS. pip install boto
and start the long running tasks.
Long running tasks
nohup python producer.py test --shard_count 1 --poster_count 50 --poster_time 34560 --quiet &
nohup python worker.py test --sleep_interval 0.1 --worker_time 34560 > 01consumer.out 2> 01worker.err < /dev/null &
With this setup done we will start getting the results from consumer where it will find the patterns in the data.
+-> shard_worker:0 Got 25 Worker Records
+--> egg location: [797, 1893] <--+
+--> egg location: [1113] <--+
With this basic understanding and hands on we will look at a real world example and perform shard and cost estimation.
Shard estimation
Question
20 stock exchange servers are generating 10 records of 250kb of data each second. 3 trading servers are consuming 50000kb of such data each second. Estimate no. of shards required for this requirements in AWS Kinesis.
Solution
AWS has defined the below formula to calculate the number of shards
Number_of_shards = max(incoming_write_bandwidth_in_KiB/1024, outgoing_read_bandwidth_in_KiB/2048)
In our case,
incoming_write_bandwidth_in_KiB =
avg.data size in kb * records per second
= 250 * 20* 10 = 50000
outgoing_read_bandwidth_in_KiB =
incoming_write_bandwidth_in_KiB * consumers
= 50000 * 3 = 150000
So, No.of.Shards
= max (50000/1024,150000/2048)
= max (48.8 , 73.2)
= 73.2
and hence 74 shards.
Cost estimation
Total number of shards = 74 Hours in a month = 730
74 shards x 730 hours in a month = 54,020.00 Shard hours per month
54,020.00 Shard hours per month x 0.015 USD = 810.30 USD
Shard hours per month cost: 810.30 USD
There can be additional cost based on Extended data retention or Enhanced fan-out etc. if being used.