Question

我正在尝试遍历文本文件的RDD，并对文件中的每个唯一单词进行计数，然后累积每个唯一单词后面的所有单词及其计数。到目前为止，这就是我所拥有的：

// connecting to spark driver
val conf = new SparkConf().setAppName("WordStats").setMaster("local")
val spark = new SparkContext(conf) //Creates a new SparkContext object

//Loads the specified file into an RDD
val lines = sparkContext.textFile(System.getProperty("user.dir") + "/" + "basketball_words_only.txt")

//Splits the file into individual words
val words = lines.flatMap(line => {

  val wordList = line.split(" ")

  for {i <- 0 until wordList.length - 1}

    yield (wordList(i), wordList(i + 1), 1)

})

如果到目前为止我还不清楚，我要做的是积累文件中每个单词后面的单词集，以及单词后面的单词次数。

Answer 1

如果我理解正确，我们会这样：

val lines: Seq[String] = ...
val words: Seq[(String, String, Int)] = ...

我们想要这样的事情：

val frequencies: Map[String, Seq[(String, Int)]] = {
  words
    .groupBy(_._1)                        // word -> [(w, next, cc), ...]
    .mapValues { values =>
      values
        .map { case (w, n, cc) => (n, cc) }
        .groupBy(_._1)                    // next -> [(next, cc), ...]
        .mapValues(_.reduce(_._2 + _._2)) // next -> sum
        .toSeq
    }
}

如何在元组设置为（String，（String，Int））的值键上使用reduceByKey？

1 个答案: