使用EMR进行Apache flink的AWS配置

时间:2017-08-11 08:38:14

标签: apache-flink emr amazon-emr amazon-kinesis flink-streaming

我有一个生产者应用程序,以每秒600条记录的速度写入Kinesis流。我编写了一个Apache flink应用程序来读取/处理和聚合此流数据,并将聚合输出写入AWS Redshift。

每条记录的平均大小为2KB。此应用程序将运行24 * 7.

我想知道我的AWS EMR群集的配置应该是什么。我需要多少个节点?我应该使用的EC2实例类型(R3 / C3)应该是什么。

除了性能方面,成本对我们也很重要。

1 个答案:

答案 0 :(得分:1)

Whether to go for r3/c3 depends on a number of resources your application is using.

I assume that you are using windowing or some stateful operator to perform the aggregation. A stateful operator will maintain the state in the StateBackend configured https://ci.apache.org/projects/flink/flink-docs-release-1.3/ops/state_backends.html#state-backends

So you can first check if the state fits in memory(if you intend to use FSStateBackend) by trying out your application on c3 type instances. You can check the memory utilization using JVisualVM. Also, try to the check the CPU utilization here.

With r3 type instances, you will get more memory with the same number of CPU that c3 provides. For Ex: c3.4xlarge instances provides 16 vCPU with 30GB memory per node whereas r34xlarge provides 16vCPU with 122GB memory per node.

So, it depends on your application what type of instances you should be using.

For the price comparison you can refer this : http://www.ec2instances.info/