org.apache.spark.SparkException:作业由于阶段故障而中止

时间:2019-03-25 12:08:55

标签: python-3.x pyspark amazon-emr

执行我的spark作业时出现错误以下。它正在处理大约一千万条记录。我在AWS Emr 5节点群集上执行。我在这里比较2个文件。每个文件有500万条记录。并从S3访问这些文件。我的实例类型是:m4.2xlarge

尝试以下命令:

spark-submit test-script.py --driver-memory 2G --num-executors 4 --executor-cores 4 --executor-memory 4G

错误日志: py4j.protocol.Py4JJavaError:调用o210.collectToPython时发生错误。 :org.apache.spark.SparkException:作业由于阶段失败而中止:780个任务的序列化结果的总大小(1031.6 MB)大于spark.driver.maxResultSize(1024.0 MB)         在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1753)中         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1741)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1740)         在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)         在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)         在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1740)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:871)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:871)         在scala.Option.foreach(Option.scala:257)         在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:871)         在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1974)         在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1923)         在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1912)         在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)         在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:682)         在org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)         在org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)         在org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)         在org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)         在org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1.apply(RDD.scala:939)         在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)         在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)         在org.apache.spark.rdd.RDD.withScope(RDD.scala:363)         在org.apache.spark.rdd.RDD.collect(RDD.scala:938)         在org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)         在org.apache.spark.sql.Dataset $$ anonfun $ collectToPython $ 1.apply(Dataset.scala:3195)         在org.apache.spark.sql.Dataset $$ anonfun $ collectToPython $ 1.apply(Dataset.scala:3192)         在org.apache.spark.sql.Dataset $$ anonfun $ 52.apply(Dataset.scala:3254)         在org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:77)         在org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)         在org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3192)         在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处         在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)         在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)         在java.lang.reflect.Method.invoke(Method.java:498)         在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)         在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)         在py4j.Gateway.invoke(Gateway.java:282)         在py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)         在py4j.commands.CallCommand.execute(CallCommand.java:79)         在py4j.GatewayConnection.run(GatewayConnection.java:238)         在java.lang.Thread.run(Thread.java:748)

0 个答案:

没有答案