使用kafka进行火花流传输一位消费者正在读取数据

时间:2018-11-14 13:16:18

标签: java apache-spark apache-kafka spark-streaming

我正在使用kafka进行火花蒸煮,我有一个包含20个分区的主题。当流作业运行时,只有一个使用者从所有主题中读取数据,这导致读取数据的速度变慢。有没有办法在火花蒸汽中为每个分区配置一个消费者。

JavaStreamingContext jsc = AnalyticsContext.getInstance().getSparkStreamContext();
Map<String, Object> kafkaParams = MessageSessionFactory.getConsumerConfigParamsMap(MessageSessionFactory.DEFAULT_CLUSTER_IDENTITY, consumerGroup);

String[] topics = topic.split(",");
Collection<String> topicCollection = Arrays.asList(topics);
metricStream = KafkaUtils.createDirectStream(
                            jsc,
                            LocationStrategies.PreferConsistent(),
                            ConsumerStrategies.Subscribe(topicCollection, kafkaParams)
);
}

TOPIC             PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                     HOST            CLIENT-ID
metric_data_spark 16         3379403197      3379436869      33672           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 7          3399030625      3399065857      35232           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 13         3389008901      3389044210      35309           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 17         3380638947      3380639928      981             consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 1          3593201424      3593236844      35420           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 8          3394218406      3394252084      33678           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 19         3376897309      3376917998      20689           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 3          3447204634      3447240071      35437           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 18         3375082623      3375083663      1040            consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 2          3433294129      3433327970      33841           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 9          3396324976      3396345705      20729           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 0          3582591157      3582624892      33735           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 14         3381779702      3381813477      33775           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 4          3412492002      3412525779      33777           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 11         3393158700      3393179419      20719           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 10         3392216079      3392235071      18992           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 15         3383001380      3383036803      35423           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 6          3398338540      3398372367      33827           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 12         3387738477      3387772279      33802           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2
metric_data_spark 5          3408698217      3408733614      35397           consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx    consumer-2

我们需要做哪些更改才能使每个消费者/每个分区读取数据。

1 个答案:

答案 0 :(得分:0)

由于您使用的是一致的展示位置策略,因此应将其分配给执行者

运行Spark提交时,您需要指定最多要启动20个执行程序。 --num-executors 20

但是,如果您做得更多,那么您将有闲置的执行程序不使用Kafka数据(但他们仍然可以处理其他阶段)