Question

我的任务是编写一个读取大文件（不适合内存）的代码将其反转并输出最常用的五个单词。
我已经编写了下面的代码，它完成了这项工作。

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

 object ReverseFile {
  def main(args: Array[String]) {


    val conf = new SparkConf().setAppName("Reverse File")
    conf.set("spark.hadoop.validateOutputSpecs", "false")
    val sc = new SparkContext(conf)
    val txtFile = "path/README_mid.md"
    val txtData = sc.textFile(txtFile)
    txtData.cache()

    val tmp = txtData.map(l => l.reverse).zipWithIndex().map{ case(x,y) => (y,x)}.sortByKey(ascending = false).map{ case(u,v) => v}

    tmp.coalesce(1,true).saveAsTextFile("path/out.md")

    val txtOut = "path/out.md"
    val txtOutData = sc.textFile(txtOut)
    txtOutData.cache()

    val wcData = txtOutData.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(ascending = false)
    wcData.collect().take(5).foreach(println)


  }
}

问题是我是spark和scala的新手，正如你在代码中看到的那样，我首先读取文件反向保存，然后将其反转并输出五个最常用的单词。

有没有办法告诉火花来保存 tmp 并同时处理 wcData （无需保存，打开文件），因为否则就像阅读文件两次。
从现在开始，我要用火花解决很多问题，所以如果有任何代码的部分（不像绝对路径名称......特定的火花）你可能认为可以写得更好欣赏它。

Answer 1

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object ReverseFile {
  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("Reverse File")
    conf.set("spark.hadoop.validateOutputSpecs", "false")
    val sc = new SparkContext(conf)
    val txtFile = "path/README_mid.md"
    val txtData = sc.textFile(txtFile)
    txtData.cache()

    val reversed = txtData
      .zipWithIndex()
      .map(_.swap)
      .sortByKey(ascending = false)
      .map(_._2) // No need to deconstruct the tuple.

    // No need for the coalesce, spark should do that by itself.
    reversed.saveAsTextFile("path/reversed.md")

    // Reuse txtData here.
    val wcData = txtData
      .flatMap(_.split(" "))
      .map(word => (word, 1))
      .reduceByKey(_ + _)
      .map(_.swap)
      .sortByKey(ascending = false)

    wcData
      .take(5) // Take already collects.
      .foreach(println)
  }
}

始终执行collect()最后一次，因此Spark可以评估群集上的内容。

Answer 2

代码中最昂贵的部分是排序，因此显而易见的改进是删除它。在完全排序完全过时的第二种情况下，它相对简单：

val wcData = txtData
  .flatMap(_.split(" "))
  .map(word => (word, 1))
  .reduceByKey(_ + _) // No need to swap or sort

// Use top method and explicit ordering in place of swap / sortByKey
val wcData = top(5)(scala.math.Ordering.by[(String, Int), Int](_._2))

线的反转顺序有点棘手。首先让我们为每个分区重新排序元素：

val reversedPartitions = txtData.mapPartitions(_.toList.reverse.toIterator)

现在你有两个选择

使用自定义分区程序

class ReversePartitioner(n: Int) extends Partitioner {
  def numPartitions: Int = n
  def getPartition(key: Any): Int = {
    val k = key.asInstanceOf[Int]
    return numPartitions - 1 - k
  }
}

val partitioner = new ReversePartitioner(reversedPartitions.partitions.size)

val reversed = reversedPartitions
  // Add current partition number
  .mapPartitionsWithIndex((i, iter) => Iterator((i, iter.toList)))
  // Repartition to get reversed order
  .partitionBy(partitioner)
  // Drop partition numbers
  .values
  // Reshape
  .flatMap(identity)

它仍然需要改组，但它相对便携，数据仍然可以在内存中访问。

如果你只想保存反转数据，你可以在saveAsTextFile上调用reversedPartitions并按逻辑重新排序输出文件。由于part-n名称格式标识了源分区，因此您只需将part-n重命名为part-(number-of-partitions - 1 -n)即可。它需要保存数据，因此它不是最佳选择，但如果您使用内存文件系统可能是一个非常好的解决方案。

火花代码优化

2 个答案: