我有以下数据集:
val data = sc.parallelize(Array(
("group1","value1"),("group1","value2"),("group1", "value1"),
("group2","value1"),("group1","value1"),("group3", "value3")
))
我使用Spark,我想获得以下结果:
Array(("group1","value1",3),("group1", "value2",1),("group2","value1",1),("group3","value3",1)
我尝试使用CombineByKey
,但我的合并器不起作用。
我查看了此处描述的代码http://codingjunkie.net/spark-combine-by-key/,但我的合并器不起作用,因为我想计算实例的数量,而不是一些数字的总和。
这是我的代码:
val reduced = data.combineByKey(
(value) => {
println(s"Create combiner -> ${value}")
(value, 1)
},
(acc: (Array[String], Int), v) => {
println(s"""Merge value : (${acc._1} :+ ${v}, ${acc._2} + 1)""")
(acc._1 :+ v, acc._2 + 1)
},
(acc1: (Array[String], Int), acc2: (Array[String], Int)) => {
println(s"""Merge Combiner : (${acc1._1} :+ ${acc2._1}, ${acc1._2} + ${acc2._2})""")
(acc1._1 :+ acc2._1, acc1._2 + acc2._2)
}
)
你有什么建议吗?
答案 0 :(得分:0)
你不是combineByKey
。 reduceByKey
会做得很好:
data.map((_, 1))
.reduceByKey(_ + _)
.map { case ((k1, k2), v) => (k1, k2, v) }
.collect
// Array[(String, String, Int)] = Array((group3,value3,1), (group1,value1,3), (group1,value2,1), (group2,value1,1)
您的代码无效,因为
(value) => {
println(s"Create combiner -> ${value}")
(value, 1)
}
的类型为String => (String, Int)
,mergeValue
的格式为(Array[String], Int)
。稍后您使用不正确的方法来连接Arrays
。
如果改变如下:
val reduced = data.combineByKey(
(value) => {
(Array(value), 1)
},
(acc: (Array[String], Int), v) => {
(acc._1 :+ v, acc._2 + 1)
},
(acc1: (Array[String], Int), acc2: (Array[String], Int)) => {
(acc1._1 ++ acc2._1, acc1._2 + acc2._2)
}
)
它会编译,但结果不会是你期望的那个:
result.collect
// Array[(String, (Array[String], Int))] = Array((group3,(Array(value3),1)), (group1,(Array(value1, value2, value1, value1),4)), (group2,(Array(value1),1)))