I have a spark job that I stripped down completely to:
spark.read.option("delimiter", delimiter)
.schema(Encoders.product[MyData].schema)
.csv("s3://bucket/data/*/*.gz")
.as[MyData]
to isolate the error and it's still giving me a java.lang.OutOfMemoryError
when running on AWS EMR on YARN. The total file size is approximately 4.7 GB gzipped (each partition file is approximately 1 to 20 kB); total number of rows = 373 063 082.
MyData (obfuscated) schema:
case class MyData(field1: Long, field2: String, field3: Int, field4: Float, field5: Float, field6: Option[Int] = None, field7: Option[Int])
The strange thing is that the job completely works in all of the following cases:
s3://bucket/data/2017*/*.gz
, and another on s3://bucket/data/2018*/*.gz
, and both succeed.master("local[*]")
. The only difference is on the cluster it uses YARN (tested with MASTER: 1 x m4.2xlarge, CORE: 25 x m4.2xlarge, TASK: 25 x m4.2xlarge
, and with smaller configurations): they all failed.In the stderr logs I get:
[Stage 0:===============================================> (9536 + 411) / 10000]
[Stage 0:=================================================>(9825 + 175) / 10000]
[Stage 0:==================================================>(9964 + 36) / 10000]
[Stage 0:===================================================>(9992 + 8) / 10000]
[Stage 0:===================================================>(9997 + 3) / 10000]
[Stage 0:===================================================>(9998 + 2) / 10000]
[Stage 0:===================================================>(9999 + 1) / 10000]
Command exiting with ret '137'
And then the Spark-UI freezes around 9999/10000.
I also ran on s3://bucket/data/201[7-8]*/*.gz
to see if it was a regex issue that was capturing more files than it was supposed to. It ended up giving the same errors.
Finally, I also checked Ganglia to try to figure out what was going on, and didn't really see anything that caught my eye.
My cluster deployment command (information starred out):
aws emr create-cluster --name $NAME --release-label emr-5.12.0 \
--log-uri s3://bucket/logs/ \
--instance-fleets InstanceFleetType=MASTER,TargetOnDemandCapacity=1,InstanceTypeConfigs=['{InstanceType=m4.xlarge}'] \
InstanceFleetType=CORE,TargetSpotCapacity=25,InstanceTypeConfigs=['{InstanceType=m4.xlarge,BidPrice=0.2,WeightedCapacity=1}'],LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=120,TimeoutAction=SWITCH_TO_ON_DEMAND}'} \
InstanceFleetType=TASK,TargetSpotCapacity=25,InstanceTypeConfigs=['{InstanceType=m4.xlarge,BidPrice=0.2,WeightedCapacity=1}'],LaunchSpecifications={SpotSpecification='{TimeoutDurationMinutes=120,TimeoutAction=SWITCH_TO_ON_DEMAND}'} \
--ec2-attributes KeyName="*****",SubnetId=subnet-d******* --use-default-roles \
--applications Name=Spark Name=Ganglia \
--steps Type=CUSTOM_JAR,Name=CopyAppFromS3,ActionOnFailure=CONTINUE,Jar="command-runner.jar",Args=[aws,s3,cp,s3://bucket/assembly-0.1.0.jar,/home/hadoop] \
Type=Spark,Name=MyApp,ActionOnFailure=CONTINUE,Args=[/home/hadoop/assembly-0.1.0.jar] --configurations file://$CONFIG_FILE --auto-terminate
I'd like to understand why spark can't read the smaller dataset when it can read one that's 15x larger (same cluster configuration), and why it runs on my local machine but not on AWS, and finally why it runs on both halves separately but not together. What kind of data could cause this? What can I do to solve this problem or avoid it in the future?
EDIT: Local machine is a MacBook Pro Retina 15-inch 2015 with 2.8 GHz Intel Core i7, 16 GB RAM, and 1 TB SSD.
EDIT2: I also got this in stderr once:
18/05/28 16:15:49 ERROR SparkContext: Exception getting thread dump from executor 1
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.SparkContext.getExecutorThreadDump(SparkContext.scala:607)
at org.apache.spark.ui.exec.ExecutorThreadDumpPage.render(ExecutorThreadDumpPage.scala:40)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
at org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171)
at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:524)
at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)