Dstream上的combineByKey引发错误

时间:2016-04-01 14:53:35

标签: scala spark-streaming rdd dstream

我有一个带有元组(String, Int)的dstream

当我尝试combineByKey时,它会告诉我指定参数:分区程序

my_dstream.combineByKey(
      (v) => (v,1),
      (acc:(Int, Int), v) => (acc._1 + v, acc._2 + 1),
      (acc1:(Int, Int), acc2:(Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
    )

但是,当我在rdd上使用它时,它可以正常工作:

 my_dstream.foreachRDD( rdd =>
      rdd.combineByKey(
        (v) => (v,1),
        (acc:(Int, Int), v) => (acc._1 + v, acc._2 + 1),
        (acc1:(Int, Int), acc2:(Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
      ))

我在哪里可以获得此分区

1 个答案:

答案 0 :(得分:1)

  

我在哪里可以获得此分区程序?

您可以自己创建。 Spark开箱即用,有两个分区:HashPartitionerRangePartitioner。默认是前者。您可以通过它的构造函数实例化,您需要传递所需分区的数量:

val numOfPartitions = // specify the amount you want
val hashPartitioner = new HashPartitioner(numOfPartitions)

my_dstream.combineByKey(
  (v) => (v,1),
  (acc:(Int, Int), v) => (acc._1 + v, acc._2 + 1),
  (acc1:(Int, Int), acc2:(Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2), 
                                        hashPartitioner
)