我正在学习Scala,我正在试图弄清楚如何在Scala中创建MapReduce程序,以便为文件中的每个单词找到最多的单词。 这就是我所拥有的。它有效,但我想实际使用map reduce,我试图找到尽可能减少循环的方法
//initialize the list with first two words
val list = scala.collection.mutable.MutableList((words.collect()(0),
words.collect()(1)));
for (x <- 1 to (words.collect().length - 2)) {
// add element into the list
list += ((words.collect()(x), words.collect()(x + 1)))
}
val rdd1 = spark.parallelize(list)
val rdd2 = rdd1.map(word => (word, 1)) // ex: key is (basketball,is) value is 1
val counter = rdd2.reduceByKey((x, y) => x + y).sortBy(_._2, false) // sort in dec
val result2 = counter.collect();
print("the most frequent follower for basketball, the, and competitive \n")
println(" ")
// calls the function
findFreq("basketball", result2)
findFreq("the", result2)
findFreq("competitive", result2)
}
// method to find the most frequent follower for the specific word
def findFreq(str: String, RDD: Array[((String, String), (Int))]): Unit =
{
var max = -1;
for (x <- RDD) {
}
// display the results
if (x._1._1.equals(str) && x._2 == max) {
println("\"" + x._1._1 + "\"" + " is followed by " + "\"" + x._1._2 + "\"" + " " + x._2 + " times.\n")
}
}
}
}
答案 0 :(得分:0)
给定一个单词数组(作为RDD),您可以在一些转换中获得跟随给定word
的最常用单词:
第1步:使用sliding(2)的单词对的RDD
.sliding(2)
第2步:以(word, w2)
为键的一对RDD,然后reduceByKey
计算给定word
.collect{ case Array(`word`, w2) => ((word, w2), 1) }
.reduceByKey( _ + _ )
第3步:以word
为关键字的一对RDD,然后reduceByKey
以最大数量捕获字对
.map{ case ((`word`, w2), c) => (word, (w2, c)) }
.reduceByKey( (acc, x) => if (x._2 > acc._2) (x._1, x._2) else acc )
将所有内容与包含在方法中的转换完全相同:
import org.apache.spark.sql.functions._
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.rdd.RDDFunctions._
// load a RDD of words from the text file
val rdd = sc.textFile("/path/to/basketball.txt")
.flatMap( _.split("""[\s,.;:!?]+""") )
.map( _.toLowerCase )
def mostFreq(word: String, rdd: RDD[String]): RDD[(String, (String, Int))] =
rdd
.sliding(2)
.collect{ case Array(`word`, w2) => ((word, w2), 1) }
.reduceByKey( _ + _ )
.map{ case ((`word`, w2), c) => (word, (w2, c)) }
.reduceByKey( (acc, x) => if (x._2 > acc._2) (x._1, x._2) else acc )
显示给定word
后面最常用的字词:
mostFreq("basketball", rdd).foreach{ case (word, (w2, c)) =>
println(s"'$word' is followed most frequently by '$w2' for $c times. ")
}
// 'basketball' is followed most frequently by 'leagues' for 2 times.
示例文本文件:/path/to/basketball.txt(来自Wikipedia的内容):
篮球是世界上最受欢迎和广受欢迎的球员之一 体育。全国篮球协会(NBA)是其中之一 世界上重要的职业篮球联赛 人气,薪水,人才和竞争水平。北外 美国,国家篮球联赛的顶级俱乐部有资格获得 大陆锦标赛,如欧洲联赛和FIBA美洲锦标赛 联盟。 FIBA篮球世界杯和男子奥运会篮球 比赛是这项运动和吸引力的主要国际赛事 来自世界各地的顶级国家队。