我正在使用kafka进行火花蒸煮,我有一个包含20个分区的主题。当流作业运行时,只有一个使用者从所有主题中读取数据,这导致读取数据的速度变慢。有没有办法在火花蒸汽中为每个分区配置一个消费者。
JavaStreamingContext jsc = AnalyticsContext.getInstance().getSparkStreamContext();
Map<String, Object> kafkaParams = MessageSessionFactory.getConsumerConfigParamsMap(MessageSessionFactory.DEFAULT_CLUSTER_IDENTITY, consumerGroup);
String[] topics = topic.split(",");
Collection<String> topicCollection = Arrays.asList(topics);
metricStream = KafkaUtils.createDirectStream(
jsc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(topicCollection, kafkaParams)
);
}
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
metric_data_spark 16 3379403197 3379436869 33672 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 7 3399030625 3399065857 35232 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 13 3389008901 3389044210 35309 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 17 3380638947 3380639928 981 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 1 3593201424 3593236844 35420 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 8 3394218406 3394252084 33678 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 19 3376897309 3376917998 20689 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 3 3447204634 3447240071 35437 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 18 3375082623 3375083663 1040 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 2 3433294129 3433327970 33841 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 9 3396324976 3396345705 20729 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 0 3582591157 3582624892 33735 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 14 3381779702 3381813477 33775 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 4 3412492002 3412525779 33777 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 11 3393158700 3393179419 20719 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 10 3392216079 3392235071 18992 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 15 3383001380 3383036803 35423 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 6 3398338540 3398372367 33827 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 12 3387738477 3387772279 33802 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 5 3408698217 3408733614 35397 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
我们需要做哪些更改才能使每个消费者/每个分区读取数据。
答案 0 :(得分:0)
由于您使用的是一致的展示位置策略,因此应将其分配给执行者
运行Spark提交时,您需要指定最多要启动20个执行程序。 --num-executors 20
但是,如果您做得更多,那么您将有闲置的执行程序不使用Kafka数据(但他们仍然可以处理其他阶段)