I'm having a lot of trouble getting a simple count operation working on about 55 files on hdfs and a total of 1B records. Both spark-shell and PySpark fail with OOM errors. I'm using yarn, MapR, Spark 1.3.1, and hdfs 2.4.1. (It fails in local mode as well.) I've tried following the tuning and configuration advice, throwing more and more memory at the executor. My configuration is
conf = (SparkConf()
.setMaster("yarn-client")
.setAppName("pyspark-testing")
.set("spark.executor.memory", "6g")
.set("spark.driver.memory", "6g")
.set("spark.executor.instances", 20)
.set("spark.yarn.executor.memoryOverhead", "1024")
.set("spark.yarn.driver.memoryOverhead", "1024")
.set("spark.yarn.am.memoryOverhead", "1024")
)
sc = SparkContext(conf=conf)
sc.textFile('/data/on/hdfs/*.csv').count() # fails every time
The job gets split into 893 tasks and after about 50 tasks are successfully completed, many start failing. I see ExecutorLostFailure in the stderr of the application. When digging through the executor logs, I see errors like the following:
15/06/24 16:54:07 ERROR util.Utils: Uncaught exception in thread stdout writer for /work/analytics2/analytics/python/envs/santon/bin/python
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57)
at java.nio.CharBuffer.allocate(CharBuffer.java:331)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792)
at org.apache.hadoop.io.Text.decode(Text.java:406)
at org.apache.hadoop.io.Text.decode(Text.java:383)
at org.apache.hadoop.io.Text.toString(Text.java:281)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:379)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1550)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
15/06/24 16:54:07 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout writer for /work/analytics2/analytics/python/envs/santon/bin/python,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57)
at java.nio.CharBuffer.allocate(CharBuffer.java:331)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792)
at org.apache.hadoop.io.Text.decode(Text.java:406)
at org.apache.hadoop.io.Text.decode(Text.java:383)
at org.apache.hadoop.io.Text.toString(Text.java:281)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:379)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1550)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
15/06/24 16:54:07 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
In the stdout:
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill %p"
# Executing /bin/sh -c "kill 16490"...
In general, I think I understand the OOM errors and troubleshooting, but I'm stuck conceptually here. This is just a simple count. I don't understand how the Java heap could possibly be overflowing when the executors have ~3G heaps. Has anyone run into this before or have any pointers? Is there something going on under the hood that would shed light on the issue?
Update:
I've also noticed that by specifying the parallelism (for example sc.textFile(..., 1000)) to the same number of tasks (893), then the created job has 920 tasks, all but the last of which complete without error. Then the very last task hangs indefinitely. This seems exceedingly strange!