如何在元组设置为(String,(String,Int))的值键上使用reduceByKey?

时间:2017-04-19 22:47:19

标签: scala apache-spark

我正在尝试遍历文本文件的RDD,并对文件中的每个唯一单词进行计数,然后累积每个唯一单词后面的所有单词及其计数。到目前为止,这就是我所拥有的:

// connecting to spark driver
val conf = new SparkConf().setAppName("WordStats").setMaster("local")
val spark = new SparkContext(conf) //Creates a new SparkContext object

//Loads the specified file into an RDD
val lines = sparkContext.textFile(System.getProperty("user.dir") + "/" + "basketball_words_only.txt")

//Splits the file into individual words
val words = lines.flatMap(line => {

  val wordList = line.split(" ")

  for {i <- 0 until wordList.length - 1}

    yield (wordList(i), wordList(i + 1), 1)

})

Output Generated By My Current MapReduce Program

如果到目前为止我还不清楚,我要做的是积累文件中每个单词后面的单词集,以及单词后面的单词次数。

1 个答案:

答案 0 :(得分:0)

如果我理解正确,我们会这样:

val lines: Seq[String] = ...
val words: Seq[(String, String, Int)] = ...

我们想要这样的事情:

val frequencies: Map[String, Seq[(String, Int)]] = {
  words
    .groupBy(_._1)                        // word -> [(w, next, cc), ...]
    .mapValues { values =>
      values
        .map { case (w, n, cc) => (n, cc) }
        .groupBy(_._1)                    // next -> [(next, cc), ...]
        .mapValues(_.reduce(_._2 + _._2)) // next -> sum
        .toSeq
    }
}