将Spark CombineByKey与一组值一起使用

时间:2018-01-14 12:12:02

标签: scala apache-spark

我有以下数据集:

val data = sc.parallelize(Array(
    ("group1","value1"),("group1","value2"),("group1", "value1"),
    ("group2","value1"),("group1","value1"),("group3", "value3")
))

我使用Spark,我想获得以下结果:

Array(("group1","value1",3),("group1", "value2",1),("group2","value1",1),("group3","value3",1)

我尝试使用CombineByKey,但我的合并器不起作用。

我查看了此处描述的代码http://codingjunkie.net/spark-combine-by-key/,但我的合并器不起作用,因为我想计算实例的数量,而不是一些数字的总和。

这是我的代码:

val reduced = data.combineByKey(
(value) => {
  println(s"Create combiner -> ${value}")
  (value, 1)
},
(acc: (Array[String], Int), v) => {
  println(s"""Merge value : (${acc._1} :+ ${v}, ${acc._2} + 1)""")
  (acc._1 :+ v, acc._2 + 1)
},
(acc1: (Array[String], Int), acc2: (Array[String], Int)) => {
  println(s"""Merge Combiner : (${acc1._1} :+ ${acc2._1}, ${acc1._2} + ${acc2._2})""")
  (acc1._1 :+ acc2._1, acc1._2 + acc2._2)
}
)

你有什么建议吗?

1 个答案:

答案 0 :(得分:0)

你不是combineByKeyreduceByKey会做得很好:

data.map((_, 1))
  .reduceByKey(_ + _)
  .map { case ((k1, k2), v) => (k1, k2, v) }
  .collect

// Array[(String, String, Int)] = Array((group3,value3,1), (group1,value1,3), (group1,value2,1), (group2,value1,1)

您的代码无效,因为

(value) => {
  println(s"Create combiner -> ${value}")
 (value, 1)
}

的类型为String => (String, Int)mergeValue的格式为(Array[String], Int)。稍后您使用不正确的方法来连接Arrays

如果改变如下:

val reduced = data.combineByKey(
  (value) => {
    (Array(value), 1)
  },
  (acc: (Array[String], Int), v) => {
    (acc._1 :+ v, acc._2 + 1)
  },
  (acc1: (Array[String], Int), acc2: (Array[String], Int)) => {
    (acc1._1 ++ acc2._1, acc1._2 + acc2._2)
  }
)

它会编译,但结果不会是你期望的那个:

result.collect
// Array[(String, (Array[String], Int))]  = Array((group3,(Array(value3),1)), (group1,(Array(value1, value2, value1, value1),4)), (group2,(Array(value1),1)))