Question

我有一个文本文件，我需要按降序打印最常出现的单词（及其出现次数），直到我打印的单词总数 n 占总文档的百分比。< / p>

到目前为止，我编写了以下代码：

// Break the file into words
val lines = sc.textFile("somefile.txt")
val words = lines.flatMap(line => line.split(" "))
words.persist()

val wordCount = words.count()
val wordCounts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
// This is how many occurrences are needed to make up 3%
val occurencesNeeded = (0.03 * wordCount).ceil

我的想法是然后使用top()查找最常出现的单词（并从列表中删除/弹出它，重复此操作直到我总共有3％。我不知道如何转动尽管如此，或者如果这是解决这个问题的正确方法。

Answer 1

这个想法包括找到一个不那么流行的词，之后包括另一个不太流行的词会绕过达到3％所需的词数。

val words = sc.textFile("somefile.txt").flatMap(_.split(" "))
words.persist()

val nbrOfWords = words.count()
val occurencesNeeded = (0.03 * nbrOfWords).ceil

val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _).cache()

words.unpersist()
wordCounts.persist()

val countThreshold =
  wordCounts.values.collect.sorted
    // (accumulator, threshold, reachedThreshold)
    .foldRight(0, Int.MaxValue, false) {
      case (count, (accumulator, threshold, false)) => {
        if (accumulator + count <= occurencesNeeded)
          (accumulator + count, count, false)
        else (accumulator, threshold, true)
      }
      // If threshold has already been found, we skip the rest:
      case (count, (accumulator, threshold, true)) =>
        (accumulator, threshold, true)
    }
    ._2

val result =
  wordCounts.filter { case (word, count) => count >= countThreshold }

wordCounts.unpersist()

result.collect.foreach(println)

假设我们有这组字数：

(("a", 34), ("b", 12), ("c", 9), ("d", 8), ...)

并且原始单词总数的3％为49。

然后我们通过这个列表和每个单词（从最受欢迎到最少），如果它的出现次数加上大多数流行单词的出现次数低于49，那么我们修改阈值在这种情况下，我们不会保留文字。

("a", 34)由于34不如49，因此我们保留字词的新阈值为34。
("b", 12)由于34 + 12 = 46低于49，因此阈值变为12，这两个最受欢迎的单词代表的字数为46。
("c", 9)因为46 + 9 = 55现在优于49，所以这个词和所有不那么流行的词都将被丢弃。因此，我们保留单词的最终字数阈值为12。
我们没有考虑其他不太流行的词汇。

请注意，在foldRight阶段，我们会使用Boolean来“停止”，以便在获得所需阈值时考虑不太流行的字词。这是必需的，因为("c", 9)会使累积字数高于49，但("s", 2)将低于49，因此阈值将变为2！

注意：此解决方案会收集驱动程序上的唯一单词列表，如果驱动程序的内存非常有限，则可能会出现问题。但这会令人惊讶，因为文件中唯一单词的数量可能不会超过~20K。

Spark - 通过求和构建排序结果列表，直到达到阈值

1 个答案: