退出状态:-100。诊断:容器在“丢失”节点上释放

时间:2019-02-12 10:19:36

标签: scala apache-spark hadoop apache-spark-sql

我有2个输入文件(一个为JSON,另一个为镶木地板),正在尝试对这2个大数据帧进行联接,并将联接的数据帧写入s3(作为JSON)。这项工作永远都被卡住了(将连接的JSON写入s3)。我正在使用70 r3.4xlarge(从站)。

df1.rdd.partitions.size = 34234(大小〜4 TB)

df2.rdd.partitions.size = 1200(大小〜58GB)

我尝试过但仍无改善的事情:

最大资源设置为true的动态分配静态分配:spark.executor.cores = 5

spark.executor.memory = 40G

spark.executor.instances = 209

更改分区,我通过将spark.default.parallelism和spark.sql.shuffle.partitions设置为2000、4000、8000、10000、20000、35000来玩分区,但没有用。

中级持久性–持久性存储(包括memory_disk和disk_only类型)(已加入df),持久性兼顾两个输入(在连接之前),对两个df都执行了一些操作,然后进行连接并写入s3

调整“ mapreduce.input.fileinputformat.split.minsize和” mapreduce.input.fileinputformat.split.maxsize(至750000000)。

我也尝试使用30个r3.8xlarge。没有改善☹

我不断收到这两个错误之一–

zeppelin-interpreter-spark-zeppelin-ip-10-0-1-213.log: WARN [2019-02-12 04:54:43,437] ({dispatcher-event-loop-8} Logging.scala[logWarning]:66) - Lost task 24117.0 in stage 3.0 (TID 32666, ip-10-0-1-242.ec2.internal, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1549914591854_0018_01_000010 on host: ip-10-0-1-242.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node

org.apache.spark.SparkException: Job aborted.
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:213)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)
  at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
  at org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)
  ... 48 elided
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2234 in stage 15.0 failed 4 times, most recent failure: Lost task 2234.3 in stage 15.0 (TID 136390, ip-10-0-1-56.ec2.internal, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Slave lost
Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1708)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1696)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1695)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1695)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1923)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1878)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1867)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:186)
  ... 82 more

有人可以告诉我我在做什么错吗?

1 个答案:

答案 0 :(得分:0)

看起来执行器由于内存问题而丢失。请尝试在spark-default.cfg文件中配置spark设置,或尝试增加计算资源