下面的代码会执行收集类型的字数统计:org.apache.spark.rdd.RDD[(String, List[(String, Int)])]
val words : org.apache.spark.rdd.RDD[(String, List[(String, Int)])] = sc.parallelize( List(("a" , List( ("test" , 1) , ("test" , 1)))) )
val missingLabels : RDD[(String, Int)] = words.flatMap(m => m._2).reduceByKey((a, b) => a + b)
println("Labels Missing")
missingLabels.collect().foreach(println)
如何抓取标签,以便提取值("test" , 2)
代替("a" , ("test" , 2))
?换句话说,键入RDD[ (String , List( (String, Int) ))]
。
答案 0 :(得分:3)
如果我理解你的话,你应该稍微玩一下元组。
import org.apache.spark.rdd.RDD
val words : RDD[(String, List[(String, Int)])] = sc.parallelize( List(("a" , List( ("test" , 1) , ("test" , 1)))) )
val wordsWithLabels = words
.flatMap {
case (label, listOfValues) => listOfValues.map {
case (word,count) => (word, (label, count))
}
}
val result = wordsWithLabels
.reduceByKey {
case ((label1, count1), (label2, count2)) =>
(label1, count1 + count2)
}
.map {
case (word, (label, count)) =>
(label, (word, count))
}
result.foreach(println)
答案 1 :(得分:1)
如果密钥可以重复,那么我假设你想将它减少到一个配对?如果是这样的话:
def reduceList(list: List[(String, Int)]) = list.groupBy(_._1).mapValues(_.aggregate(0)(_ + _._2, _ + _))
val words : org.apache.spark.rdd.RDD[(String, List[(String, Int)])] = sc.parallelize( List(("a" , List( ("test" , 1) , ("test" , 1)))) )
val mergedList = words.mapValues((list : List[(String, Int)]) => reduceList(list).toList)
val missingLabels = mergedList.reduceByKey((accum: List[(String, Int)], value: List[(String, Int)]) =>
{
val valueMap = value.toMap
val accumMap = accum.toMap
val mergedMap = accumMap ++ valueMap.map{case(k,v) => k -> (v + accumMap.getOrElse(k, 0))}
mergedMap.toList
})
missingLabels.foreach(println)