Spark流示例使用其他参数调用updateStateByKey

时间:2015-03-11 22:14:47

标签: scala streaming apache-spark

想知道为什么StatefulNetworkWordCount.scala示例调用臭名昭着的updateStateByKey()函数,该函数应该仅将函数作为参数使用:

val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,
  new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD)

为什么需要(以及如何处理 - 这不是在updateStateByKey()的签名中?)传递分区器,布尔值和RDD?

感谢, 马特

1 个答案:

答案 0 :(得分:4)

这是因为:

  1. 您会看到不同的Spark版本分支:https://github.com/apache/spark/blob/branch-1.3/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala。在Spark 1.2中,此代码只有updateStateByKey接收单个函数作为参数,而在1.3中它们已经优化了它
  2. 1.2和1.3中都存在updateStateByKey的不同版本。但在1.2中没有包含4个参数的版本,它仅在1.3:https://github.com/apache/spark/blob/branch-1.3/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala
  3. 中引入

    以下是代码:

    /**
    * Return a new "state" DStream where the state for each key is updated by applying
    * the given function on the previous state of the key and the new values of each key.
    * org.apache.spark.Partitioner is used to control the partitioning of each RDD.
    * @param updateFunc State update function. Note, that this function may generate a different
    * tuple with a different key than the input key. Therefore keys may be removed
    * or added in this way. It is up to the developer to decide whether to
    * remember the partitioner despite the key being changed.
    * @param partitioner Partitioner for controlling the partitioning of each RDD in the new
    * DStream
    * @param rememberPartitioner Whether to remember the paritioner object in the generated RDDs.
    * @param initialRDD initial state value of each key.
    * @tparam S State type
    */
    def updateStateByKey[S: ClassTag](
        updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
        partitioner: Partitioner,
        rememberPartitioner: Boolean,
        initialRDD: RDD[(K, S)]
    ): DStream[(K, S)] = {
        new StateDStream(self, ssc.sc.clean(updateFunc), partitioner,
        rememberPartitioner, Some(initialRDD))
    }