我使用spark-submit运行以下python脚本,
r = rdd.map(list).groupBy(lambda x: x[0]).map(lambda x: x[1]).map(list)
r_labeled = r.map(f_0).flatMap(f_1)
r_labeled.map(lambda x: x[3]).collect()
它得到了java.lang.OutOfMemoryError,特别是关于最后一行的collect()动作,
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
17/11/08 08:27:31 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 6,5,main]
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
17/11/08 08:27:31 INFO SparkContext: Invoking stop() from shutdown hook
17/11/08 08:27:31 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 6, localhost, executor driver): java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
消息显示OutOfMemoryError但没有别的。它是关于堆,垃圾收集或任何东西?我不知道。
无论如何,我试图将关于内存的一切都配置成巨大的价值。
spark.driver.maxResultSize = 0 # no limit
spark.driver.memory = 150g
spark.executor.memory = 150g
spark.worker.memory = 150g
(并且服务器有157g可用的物理内存。)
还有同样的错误。
然后我稍微减少了输入数据,每次都完美地传递了代码。事实上,collect()获得的数据大约为1.8g,远远小于物理15g内存。
现在,我确定错误与代码无关,物理内存没有限制。这就像输入数据的大小有一个阈值,传递它会导致内存不足错误。
那么我怎样才能解除这个问题,这样我才能处理更大的输入数据而不会出现内存错误?任何设置?
感谢。
==========跟进============
根据this,此错误与Java Serializer和MAP转换中的大对象有关。我在我的代码中使用了大对象。 想知道如何让Java Serializer适应大对象。
答案 0 :(得分:0)
首先,在调用collect方法时,只会遇到问题。 Spark太懒了。因此,在将数据发送到驱动程序(收集,减少,计数......)或磁盘(写入,保存...)之前,它绝对没有任何作用。
然后,您似乎在执行程序上遇到内存不足异常。我从堆栈跟踪中了解到,您的groupBy正在创建一个大小超过定义容量的数组(根据this,Integer.MAX_VALUE - 5)。是否有可能某个键在您的数据集中出现超过20亿次?
在任何情况下,我都不确定你想要做什么,但如果可以的话,尝试用减少操作替换groupBy,这会减少对内存的压力。
最后你向司机和每位执行人提供了150克,尽管每个人只有150克。我不知道你的情况是谁。尝试合理分享你的记忆,并告诉我们会发生什么。
希望这有帮助。