Question

I'm having a lot of trouble getting a simple count operation working on about 55 files on hdfs and a total of 1B records. Both spark-shell and PySpark fail with OOM errors. I'm using yarn, MapR, Spark 1.3.1, and hdfs 2.4.1. (It fails in local mode as well.) I've tried following the tuning and configuration advice, throwing more and more memory at the executor. My configuration is conf = (SparkConf() .setMaster("yarn-client") .setAppName("pyspark-testing") .set("spark.executor.memory", "6g") .set("spark.driver.memory", "6g") .set("spark.executor.instances", 20) .set("spark.yarn.executor.memoryOverhead", "1024") .set("spark.yarn.driver.memoryOverhead", "1024") .set("spark.yarn.am.memoryOverhead", "1024") ) sc = SparkContext(conf=conf) sc.textFile('/data/on/hdfs/*.csv').count() # fails every time The job gets split into 893 tasks and after about 50 tasks are successfully completed, many start failing. I see ExecutorLostFailure in the stderr of the application. When digging through the executor logs, I see errors like the following: 15/06/24 16:54:07 ERROR util.Utils: Uncaught exception in thread stdout writer for /work/analytics2/analytics/python/envs/santon/bin/python java.lang.OutOfMemoryError: Java heap space at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57) at java.nio.CharBuffer.allocate(CharBuffer.java:331) at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792) at org.apache.hadoop.io.Text.decode(Text.java:406) at org.apache.hadoop.io.Text.decode(Text.java:383) at org.apache.hadoop.io.Text.toString(Text.java:281) at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558) at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:379) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1550) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) 15/06/24 16:54:07 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout writer for /work/analytics2/analytics/python/envs/santon/bin/python,5,main] java.lang.OutOfMemoryError: Java heap space at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57) at java.nio.CharBuffer.allocate(CharBuffer.java:331) at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792) at org.apache.hadoop.io.Text.decode(Text.java:406) at org.apache.hadoop.io.Text.decode(Text.java:383) at org.apache.hadoop.io.Text.toString(Text.java:281) at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558) at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:379) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1550) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) 15/06/24 16:54:07 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM In the stdout: # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="kill %p" # Executing /bin/sh -c "kill 16490"... In general, I think I understand the OOM errors and troubleshooting, but I'm stuck conceptually here. This is just a simple count. I don't understand how the Java heap could possibly be overflowing when the executors have ~3G heaps. Has anyone run into this before or have any pointers? Is there something going on under the hood that would shed light on the issue? Update: I've also noticed that by specifying the parallelism (for example sc.textFile(..., 1000)) to the same number of tasks (893), then the created job has 920 tasks, all but the last of which complete without error. Then the very last task hangs indefinitely. This seems exceedingly strange!

Answer 1

事实证明，我遇到的问题实际上与单个已损坏的文件有关。在文件上运行简单的cat或wc -l会导致终端挂起。

Answer 2

尝试在控制台上增加JAVA堆大小，如下所示

export JAVA_OPTS="-Xms512m -Xmx5g"

您可以根据数据和内存大小更改值，-Xms表示最小内存，-Xmx表示最大值。希望它会对你有所帮助。

java.lang.OutOfMemoryError for simple rdd.count() operation

2 个答案: