Spark Job通过运行相同的地图3次失败

时间:2016-10-25 05:59:17

标签: scala hadoop apache-spark dataframe yarn

我的工作有一个步骤,我将数据框转换为RDD [(键,值)],但步骤运行三次3次并在第三次卡住并失败

Spark UI显示:

活跃职位(1)

  Job Id (Job Group)      Description    Submitted  Duration    Stages: Succeeded/Total Tasks (for all stages): Succeeded/Total

    3 (zeppelin-20161017-005442_839671900)   Zeppelin map at <console>:69      2016/10/25 05:50:02  1.6 min     0/1      210/45623

已完成工作(2)

  2 (zeppelin-20161017-005442_839671900)    Zeppelin map at <console>:69    2016/10/25 05:16:28     23 min  1/1       46742/46075 (21 failed)
  1 (zeppelin-20161017-005442_839671900)    Zeppelin map at <console>:69    2016/10/25 04:47:58     17 min  1/1        47369/46795 (20 failed) 

这是代码:

 val eventsRDD = eventsDF.map {

      r =>
        val customerId = r.getAs[String]("customerId")
        val itemId = r.getAs[String]("itemId")
        val countryId = r.getAs[Long]("countryId").toInt
        val timeStamp = r.getAs[String]("eventTimestamp")

        val totalRent = r.getAs[Int]("totalRent")
        val totalPurchase = r.getAs[Int]("totalPurchase")
        val totalProfit = r.getAs[Int]("totalProfit")

        val store = r.getAs[String]("store")
        val itemName = r.getAs[String]("itemName")

        val itemName = if (itemName.size > 0 && itemName.nonEmpty && itemName != null ) itemName else "NA"


        (itemId, (customerId, countryId, timeStamp, totalRent, totalProfit, totalPurchase, store,itemName ))



    }

有人能说出这里有什么问题吗?如果我想坚持/缓存我应该做哪一个?

错误:

16/10/25 23:28:55 INFO YarnClientSchedulerBackend: Asked to remove non-existent executor 181
16/10/25 23:28:55 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1477415847345_0005_02_031011 on host: ip-172-31-14-104.ec2.internal. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_1477415847345_0005_02_031011
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
                at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
                at org.apache.hadoop.util.Shell.run(Shell.java:456)
                at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
                at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
                at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
                at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                at java.lang.Thread.run(Thread.java:745)

1 个答案:

答案 0 :(得分:0)

您的映射操作会导致一些错误,并且会导致驱动程序导致任务失败。

默认情况下,spark.task.maxFailures的值为4,用于:

  

放弃工作前任何特定任务的失败次数。   不同任务之间传播的故障总数不会   使工作失败;一个特定的任务必须失败这个数字   尝试。应大于或等于1.允许的数量   retries =此值 - 1.

那么当你的任务失败时会发生什么火花尝试重新计算地图操作,直到它总共失败了4次。

如果我想要坚持/缓存我应该做哪一个? cache只是特定的持久操作,其中RDD以默认存储级别(MEMORY_ONLY)保留。