迭代一个巨大的列表会导致超出gc开销限制

时间:2017-07-09 10:53:44

标签: scala apache-spark garbage-collection

我有一个非常大的文件。对于文件每行的每个bigram(双字),我必须检查整个文件。我在scala中做的事情显然是错误的,但我不知道如何解决它。

此函数返回文件的所有行(大约3百万!)

def allSentences() : ArrayList[String] = {
      val res: ArrayList[String] = new ArrayList[String]()
      val filename = "/path/test.txt"
      val fstream: FileInputStream = new FileInputStream(filename)
      val br: BufferedReader = new BufferedReader(new InputStreamReader(fstream))
      var strLine: String = null
      while ({strLine = br.readLine();  strLine!= null})
        res.add(strLine)
      br.close()
      return res
   }

这就是我使用它的方式:

val p = sc.textFile("file:///path/test.txt")
val result11 = p
           .flatMap(line => biTuple(line))
           .map(word => (word, 1))
           .reduceByKey(_ + _)

          val result10 = result11
            .flatMap { tuple => allSentences().map(tuple._1 -> _) }
            .map(tuple => (tuple._1, count10(tuple._1,tuple._2)))
            .reduceByKey(_ + _)

我几乎可以肯定问题出在.flatMap { tuple => allSentences().map(tuple._1 -> _) },但还有其他方法吗?

P.S:biTuple()返回该行所有双字母的ArrayList。如果bigram的第一个单词存在于行中,则count10()返回1,而第二个单词则不存在。 result11是所有bugrams的RDD,其计数如(“word1 word2”,count)

这是错误输出:

java.lang.OutOfMemoryError: GC overhead limit exceeded
        at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
        at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
        at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
        at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
        at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

请注意,我有SPARK_WORKER_MEMORY=90GSPARK_DRIVER_MEMORY=90G

1 个答案:

答案 0 :(得分:2)

看起来你要做的是result11p(你的原始句子列表)的笛卡尔积,但是你通过打开和阅读整个句子来做将文件存入result11中每个条目的内存中。这肯定会给垃圾收集器带来压力,尽管我无法肯定地说这是造成GC问题的原因。 Spark在RDD上有一个cartesian方法,如果我对你尝试做的事情的解释是正确的,它可能会更好。 (但是,它会通过网络进行大量数据复制。)

您还可以调查是否应在过滤操作中使用count10逻辑,从而减少最终reduceByKey需要处理的条目数。