我正在尝试遍历文本文件的RDD,并对文件中的每个唯一单词进行计数,然后累积每个唯一单词后面的所有单词及其计数。到目前为止,这就是我所拥有的:
// connecting to spark driver
val conf = new SparkConf().setAppName("WordStats").setMaster("local")
val spark = new SparkContext(conf) //Creates a new SparkContext object
//Loads the specified file into an RDD
val lines = sparkContext.textFile(System.getProperty("user.dir") + "/" + "basketball_words_only.txt")
//Splits the file into individual words
val words = lines.flatMap(line => {
val wordList = line.split(" ")
for {i <- 0 until wordList.length - 1}
yield (wordList(i), wordList(i + 1), 1)
})
如果到目前为止我还不清楚,我要做的是积累文件中每个单词后面的单词集,以及单词后面的单词次数。
答案 0 :(得分:0)
如果我理解正确,我们会这样:
val lines: Seq[String] = ...
val words: Seq[(String, String, Int)] = ...
我们想要这样的事情:
val frequencies: Map[String, Seq[(String, Int)]] = {
words
.groupBy(_._1) // word -> [(w, next, cc), ...]
.mapValues { values =>
values
.map { case (w, n, cc) => (n, cc) }
.groupBy(_._1) // next -> [(next, cc), ...]
.mapValues(_.reduce(_._2 + _._2)) // next -> sum
.toSeq
}
}