我们在Spark Kafka Streaming 0.10中使用KafkaUtil来创建KafkaRDD并从一个Kafka主题中读取。然而,我们发现读取速度仅为9000事件/秒左右。我的每条消息大小都在10-20k左右。这是正常的吗?
我的加载Rdd函数如下所示,我使用coalesce()和shuffle = true来分隔各个阶段,然后我可以看到从Kafka读取的估计时间通常是成本。(我知道这包括随机播放时间,但是我们的网络非常快,而且洗牌通常只需要几秒钟。)代码读取大约250k日志,一共有16个核心和执行程序内存32G。估计时间约为30秒。
private def loadRdd[T:ClassTag](maxMessages: Long = 0, messageFormatter: ((String, String)) => T)
(implicit inputConfig: Config): (RDD[T], Unit => Unit, Boolean) = {
val brokersConnectionString = Try(inputConfig.getString("brokersConnectionString")).getOrElse(throw new RuntimeException("Fail to retrieve the broker connection string."))
val topic = inputConfig.getString("topic")
val groupId = inputConfig.getString("groupId")
val retriesAttempts = Try(inputConfig.getInt("retries.attempts")).getOrElse(SparkKafkaProviderUtilsFunctions.DEFAULT_RETRY_ATTEMPTS)
val retriesDelay = Try(inputConfig.getInt("retries.delay")).getOrElse(SparkKafkaProviderUtilsFunctions.DEFAULT_RETRY_DELAY) * 1000
val topicOffsetRanges = KafkaClusterUtils.getTopicOffsetRanges(inputConfig, topic, SparkKafkaProviderUtilsFunctions.getDebugLogger(inputConfig)).toList
.map { case (partitionId, (minOffset, maxOffset)) => OffsetRange(topic, partitionId, minOffset, maxOffset) }
.toArray
val (offsetRanges, readAllAvailableMessages) = restrictOffsetRanges(topicOffsetRanges, maxMessages)
val rdd: RDD[ConsumerRecord[String, String]] = RetryUtils.retryOrDie(retriesAttempts, retryDelay = retriesDelay, loopFn = {SparkLogger.warn("Failed to create Spark RDD, retrying...")},
failureFn = { SparkLogger.warn("Failed to create Spark RDD, giving up...")})(
KafkaUtils.createRDD(sc, KafkaClusterUtils.getKafkaConsumerParameters(brokersConnectionString, groupId), offsetRanges, LocationStrategies.PreferConsistent))
(rdd.map(pair => messageFormatter(pair.key(), pair.value())), Unit => commitOffsets(offsetRanges, inputConfig), readAllAvailableMessages)
}
我的代码触发RDD加载,请忽略参数不匹配,因为它们之间有包装。
private def loadRawClientLogsFromKafka(inputConfig: Config, logFilter: DataMap => Boolean = { b => true }, groupedCountAccumulator: Option[Accumulator[Long]] = None,
flattenedCountAccumulator: Option[Accumulator[Long]] = None, invalidLogAccumulator: Option[Accumulator[Long]] = None):
(RDD[DataMap], Unit => Unit, DateTime) = {
val maxRecordPerRun = inputConfig.getLong("maxRecordPerRun")
val startReadTime = DateTime.now
val (kafkaRdd, commitFunction, readAllAvailableMessages) = sc.loadRdd(maxRecordPerRun)(inputConfig)
val coalecedClientLogsRdd = if(kafkaRdd.partitions.length> 0) kafkaRdd.coalesce(kafkaRdd.partitions.length, shuffle = true) else kafkaRdd
}
我觉得Spark在从Kafka读书时看起来没有足够的并行性,有没有办法对其进行优化?