从卡夫卡读取火花流需要很长时间

时间:2018-10-19 13:41:17

标签: apache-spark hadoop apache-kafka bigdata apache-zookeeper

我使用CDH5.14.2构建集群,包括5个节点,每个节点具有130G内存和40个CPU内核。我构建了Spark Streaming应用程序,以从多个kafka主题(约10个kafka主题)中读取内容,并分别汇总kafka消息。最后将kafka偏移量保存到zookeeper中。最终,我发现spark任务需要很长时间来处理kafka消息。 kafka消息不偏斜,我发现从kafka读取火花需要很长时间。

My code script:

// build input steeam from kafka topic
JavaInputDStream<ConsumerRecord<String, String>> stream1 = MyKafkaUtils.
    buildInputStream(KafkaConfig.kafkaFlowGrouppId, topic1, ssc);
JavaInputDStream<ConsumerRecord<String, String>> stream2 = MyKafkaUtils.
    buildInputStream(KafkaConfig.kafkaFlowGrouppId, topic2, ssc);
JavaInputDStream<ConsumerRecord<String, String>> stream3 = MyKafkaUtils.
    buildInputStream(KafkaConfig.kafkaFlowGrouppId, topic3, ssc);
...

// aggregate kafka message use spark sql
result1 = process(stream1);
result2 = process(stream2);
result3 = process(stream3);
...

// write result to kafka kafka
writeToKafka(result1);
writeToKafka(result2);
writeToKafka(result3);

// save offset to zookeeper
saveOffset(stream1);
saveOffset(stream2);
saveOffset(stream3);

spark web ui信息:   enter image description here

0 个答案:

没有答案