Question

spark中的简单wordcount程序不会溢出到磁盘并导致OOM错误。简而言之：

环境：

WL.JSONStore.get("JSONSTORENAME").advancedFind([query]).then(res => { console.log(res)}).fail(res => { console.log(res)});

代码：

Spark: 2.3.0, Scala 2.11.8
3 x Executor, each: 1 core + 512 MB RAM
Text file: 341 MB
Other configurations are default (spark.memory.fraction = 0.6)

错误：

import org.apache.spark.SparkContext

object WordCount {

    def main(args: Array[String]): Unit = {

        val inPath = args(0)

        val sc = new SparkContext("spark://master:7077", "Word Count ver3")
        val words = sc.textFile(inPath, minPartitions = 20)
                      .map(line => line.toLowerCase())
                      .flatMap(text => text.split(' '))
        val wc = words.groupBy(word => word)
                      .map({ case (groupName, groupList) => (groupName, groupList.size) })
                      .count()
    }
}

heapdump：

问题是：

执行的堆大小将是（512 - 300）* 0.6 = 127 MB（因为我不使用缓存）。为什么ExternalAppendOnlyMap大小超过380 MB？该类必须存储在堆内存中，并且其大小不能大于堆大小。
ExternalAppendOnlyMap是一个可溢出的类，由于在这种情况下缺少内存，它应该将其数据溢出到磁盘，但在这种情况下它没有，导致OOM错误。
程序的堆内存分为：Spark执行内存和用户内存。查看堆转储，哪些对象将存储在堆内存的哪个部分？

非常感谢你的时间。

为什么火花洗牌不会溢出到磁盘？

0 个答案: