Question

我试图播放一个不那么大的地图（当保存到HDFS作为文本文件时大约70 MB），我出现了内存错误。我试图将驱动程序内存增加到11G并将执行程序内存增加到11G，但仍然会出现同样的错误。 memory.fraction设置为0.3，并且缓存的数据也不多（小于1G）。

当地图只有2 MB左右时，没问题。我想知道在播放对象时是否存在大小限制。如何使用更大的地图解决这个问题？谢谢！

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.IdentityHashMap.resize(IdentityHashMap.java:469)
    at java.util.IdentityHashMap.put(IdentityHashMap.java:445)
    at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:159)
    at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:229)
    at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:194)
    at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:186)
    at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:54)
    at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
    at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
    at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
    at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
    at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:165)
    at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:143)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801)
    at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:648)
    at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1006)
    at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
    at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:85)
    at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
    at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
    at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1327)

修改根据评论添加更多信息：

我使用spark-submit以客户端模式提交已编译的jar文件。 Spark 1.5.0
spark.yarn.executor.memoryOverhead 600
set（＆＃34; spark.kryoserializer.buffer.max＆＃34;，＆＃34; 256m＆＃34;）
设置（＆＃34; spark.speculation＆＃34;，＆＃34; true＆＃34;）
set（＆＃34; spark.storage.memoryFraction＆＃34;，＆＃34; 0.3＆＃34;）
设置（＆＃34; spark.driver.memory＆＃34;，＆＃34; 15G＆＃34;）
设置（＆＃34; spark.executor.memory＆＃34;，＆＃34; 11G＆＃34;）
我尝试过设置（＆＃34; spar.sql.tungsten.enabled＆＃34;，＆＃34; false＆＃34;）并且它没有帮助。
主机有60G内存。大约30G用于Spark / Yarn。我不确定我的工作堆大小是多少，但是其他进程并没有同时进行。特别是地图只有70MB左右。

与广播相关的一些代码：

val mappingAllLocal: Map[String, Int] = mappingAll.rdd.map(r => (r.getAs[String](0), r.getAs[Int](1))).collectAsMap().toMap
// I can use the above mappingAll to HDFS, and it's around 70MB
val mappingAllBrd = sc.broadcast(mappingAllLocal) // <-- this is where the out of memory happens

Answer 1

使用avro-mapred-1.7.7.jar对客户端模式没有影响。提交应用程序时，必须使用命令行参数set("spark.driver.memory", "15G")来增加驱动程序的堆大小。

Answer 2

您可以尝试增加JVM堆大小：

-Xmx2g : max size of 2Go
-Xms2g : initial size of 2Go (default size is 256mo)

Spark：广播对象时内存不足

2 个答案: