Spark Read Kafka RDD非常慢

时间:2018-06-07 19:33:12

标签: scala performance apache-spark apache-kafka scale

我们在Spark Kafka Streaming 0.10中使用KafkaUtil来创建KafkaRDD并从一个Kafka主题中读取。然而,我们发现读取速度仅为9000事件/秒左右。我的每条消息大小都在10-20k左右。这是正常的吗?

我的加载Rdd函数如下所示,我使用coalesce()和shuffle = true来分隔各个阶段,然后我可以看到从Kafka读取的估计时间通常是成本。(我知道这包括随机播放时间,但是我们的网络非常快,而且洗牌通常只需要几秒钟。)代码读取大约250k日志,一共有16个核心和执行程序内存32G。估计时间约为30秒。

private def loadRdd[T:ClassTag](maxMessages: Long = 0, messageFormatter: ((String, String)) => T)
                             (implicit inputConfig: Config): (RDD[T], Unit => Unit, Boolean) = {
    val brokersConnectionString = Try(inputConfig.getString("brokersConnectionString")).getOrElse(throw new RuntimeException("Fail to retrieve the broker connection string."))
    val topic                   = inputConfig.getString("topic")
    val groupId                 = inputConfig.getString("groupId")
    val retriesAttempts         = Try(inputConfig.getInt("retries.attempts")).getOrElse(SparkKafkaProviderUtilsFunctions.DEFAULT_RETRY_ATTEMPTS)
    val retriesDelay            = Try(inputConfig.getInt("retries.delay")).getOrElse(SparkKafkaProviderUtilsFunctions.DEFAULT_RETRY_DELAY) * 1000

    val topicOffsetRanges = KafkaClusterUtils.getTopicOffsetRanges(inputConfig, topic, SparkKafkaProviderUtilsFunctions.getDebugLogger(inputConfig)).toList
                                         .map { case (partitionId, (minOffset, maxOffset)) => OffsetRange(topic, partitionId, minOffset, maxOffset) }
                                         .toArray

    val (offsetRanges, readAllAvailableMessages) = restrictOffsetRanges(topicOffsetRanges, maxMessages)

    val rdd: RDD[ConsumerRecord[String, String]] = RetryUtils.retryOrDie(retriesAttempts, retryDelay = retriesDelay, loopFn = {SparkLogger.warn("Failed to create Spark RDD, retrying...")},
                                failureFn = { SparkLogger.warn("Failed to create Spark RDD, giving up...")})(
      KafkaUtils.createRDD(sc, KafkaClusterUtils.getKafkaConsumerParameters(brokersConnectionString, groupId), offsetRanges, LocationStrategies.PreferConsistent))

    (rdd.map(pair => messageFormatter(pair.key(), pair.value())), Unit => commitOffsets(offsetRanges, inputConfig), readAllAvailableMessages)

}

我的代码触发RDD加载,请忽略参数不匹配,因为它们之间有包装。

private def loadRawClientLogsFromKafka(inputConfig: Config, logFilter: DataMap => Boolean = { b => true }, groupedCountAccumulator: Option[Accumulator[Long]] = None,
                                     flattenedCountAccumulator: Option[Accumulator[Long]] = None, invalidLogAccumulator: Option[Accumulator[Long]] = None):
                                     (RDD[DataMap], Unit => Unit, DateTime) = {
    val maxRecordPerRun = inputConfig.getLong("maxRecordPerRun")
    val startReadTime = DateTime.now
    val (kafkaRdd, commitFunction, readAllAvailableMessages) = sc.loadRdd(maxRecordPerRun)(inputConfig)
    val coalecedClientLogsRdd = if(kafkaRdd.partitions.length> 0) kafkaRdd.coalesce(kafkaRdd.partitions.length, shuffle = true) else kafkaRdd

}

我觉得Spark在从Kafka读书时看起来没有足够的并行性,有没有办法对其进行优化?

0 个答案:

没有答案