我需要一个函数,它将接收两个文件作为输入,并在两个文件之间输出最多的共享字。
例如: File1 =谢谢谢谢你 File2 =谢谢你
输出将是'你',因为它共享了2次。我目前的职能:
def sharedWord(a:String,b:String) : String ={
val aFile = sc.textFile(a);
val bFile = sc.textFile(b);
val flattenMapa = aFile.flatMap(line => line.split(" "));
val flattenMapb = bFile.flatMap(line => line.split(" "));
val mapreduceA = flattenMapa.map(word => (word.toLowerCase, 1)).reduceByKey((key,value) => key+value);
val mapreduceB = flattenMapb.map(word => (word.toLowerCase, 1)).reduceByKey((key,value) => key+value);
//not sure how to compare the two mapreduce collections of words
//val common = most shared word.
return common
}
我坚持如何正确比较两个map-reduce键值
答案 0 :(得分:0)
这是缺失的部分。
// mapreduceA: RDD of [(thank, 3), (you, 2)]
// mapreduceB: RDD of [(thank, 1), (you, 3)]
val joinedRDD = mapreduceA.join(mapreduceB) // (1)
// RDD of ((thank,(3,1)), (you,(2,3)))
val minimumWordFrequencies = joinedRDD.mapValues(x => List(x._1, x._2).min) // (2)
// RDD of ((thank,1), (you,2))
val mostFrequentWord =
minimumWordFrequencies.reduce((a, b) => if (a._2 > b._2) a else b) // (3)
// mostFrequentWord: (you,2)
val common = mostFrequentWord._1