在火花流中使用mapWithState / checkpoint时,为什么会在处理时间图中看到周期性脉冲?

时间:2016-12-05 05:22:07

标签: scala apache-spark spark-streaming checkpoint

我编写了一个stateful-wordCount spark流应用程序,它可以持续从Kafka接收数据。我的代码包含mapWithState函数,可以正常运行。当我在spark UI上检查流式统计时,我在处理时间图表中发现了一些周期性脉冲。我认为这可能是由检查点的使用引起的。希望有人能解释一下,非常感谢!

The Streaming Statistics

和完成的批次表:

batches processing time

我发现定期发生一些1秒钟的批次批次。然后我进入一个1秒钟的成本批次和一个亚秒级成本的批次,发现1秒钟的成本批次比另一个更多。

比较两种批次: 1-second-time-cost batch subsecond-time-cost batch

这似乎是由checkpoint造成的,但我不确定。

任何人都能为我详细解释一下吗?谢谢!

这是我的代码:

import kafka.serializer.StringDecoder 
import org.apache.spark.streaming._ 
import org.apache.spark.streaming.kafka._ 
import org.apache.spark.SparkConf 

object StateApp {

  def main(args: Array[String]) {

    if (args.length < 4) {
      System.err.println(
        s"""
           |Usage: KafkaSpark_008_test <brokers> <topics> <batchDuration>
           |  <brokers> is a list of one or more Kafka brokers
           |  <topics> is a list of one or more kafka topics to consume from
           |  <batchDuration> is the batch duration of spark streaming
           |  <checkpointPath> is the checkpoint directory
        """.stripMargin)
      System.exit(1)
    }

    val Array(brokers, topics, bd, cpp) = args

    // Create context with 2 second batch interval
    val sparkConf = new SparkConf().setAppName("KafkaSpark_080_test")
    val ssc = new StreamingContext(sparkConf, Seconds(bd.toInt))

    ssc.checkpoint(cpp)

    // Create direct kafka stream with brokers and topics
    val topicsSet = topics.split(",").toSet
    val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
    val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topicsSet)

    // test the messages' receiving speed
    messages.foreachRDD(rdd =>
      println(System.currentTimeMillis() + "\t" + System.currentTimeMillis() / 1000 + "\t" + (rdd.count() / bd.toInt).toString))

    // the messages' value type is "timestamp port word", eg. "1479700000000 10105 ABC"
    // wordDstream: (word, 1), eg. (ABC, 1)
    val wordDstream = messages.map(_._2).map(msg => (msg.split(" ")(2), 1))

    // this is from Spark Source Code example in Streaming/StatefulNetworkWordCount.scala
    val mappingFunc = (word: String, one: Option[Int], state: State[Int]) => {
      val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
      val output = (word, sum)
      state.update(sum)
      output
    }

    val stateDstream = wordDstream.mapWithState(
      StateSpec.function(mappingFunc)).print()

    // Start the computation
    ssc.start()
    ssc.awaitTermination()   }

}

1 个答案:

答案 0 :(得分:1)

您看到的这些小峰值是由将数据检查到持久存储而引起的。为了使Spark能够进行状态完全转换,它需要在每个定义的时间间隔内可靠地存储数据,以便在发生故障时能够恢复。

请注意,峰值在每50秒执行一次时是一致的。此计算为:(batch time * default multiplier),其中当前默认乘数为10.在您的情况下,这是5 * 10 = 50,这解释了为什么尖峰每50秒可见一次。