编写Scala函数以查找两个文件之间最共享的单词

时间:2018-04-29 23:23:53

标签: scala apache-spark

我需要一个函数,它将接收两个文件作为输入,并在两个文件之间输出最多的共享字。

例如: File1 =谢谢谢谢你 File2 =谢谢你

输出将是'你',因为它共享了2次。我目前的职能:

def sharedWord(a:String,b:String) : String ={
val aFile = sc.textFile(a);
val bFile = sc.textFile(b);
val flattenMapa = aFile.flatMap(line => line.split(" "));
val flattenMapb = bFile.flatMap(line => line.split(" "));
val mapreduceA = flattenMapa.map(word => (word.toLowerCase, 1)).reduceByKey((key,value) => key+value);
val mapreduceB = flattenMapb.map(word => (word.toLowerCase, 1)).reduceByKey((key,value) => key+value);
//not sure how to compare the two mapreduce collections of words
//val common = most shared word.
return common
}

我坚持如何正确比较两个map-reduce键值

1 个答案:

答案 0 :(得分:0)

这是缺失的部分。

// mapreduceA: RDD of [(thank, 3), (you, 2)]
// mapreduceB: RDD of [(thank, 1), (you, 3)]

val joinedRDD = mapreduceA.join(mapreduceB)                                 // (1)
// RDD of ((thank,(3,1)), (you,(2,3)))

val minimumWordFrequencies = joinedRDD.mapValues(x => List(x._1, x._2).min) // (2)
// RDD of ((thank,1), (you,2))

val mostFrequentWord =
    minimumWordFrequencies.reduce((a, b) => if (a._2 > b._2) a else b)      // (3)
// mostFrequentWord: (you,2)

val common = mostFrequentWord._1
  1. 按键(word)加入两个PairRDD。
  2. 应用函数选择最小字频数。
  3. 选择最常用的词。