Spark:如何检索哪个阶段失败的原始数据?

时间:2016-03-25 08:22:35

标签: java apache-spark spark-streaming

当处理Spark时会遇到异常,尝试重新处理它三次,如下面的日志中所示。然后它将舞台标记为失败。我想要检索舞台未能在以后分析它的所有数据,或者用它做任何其他事情。如何才能做到这一点?我正在使用SparkListeners进行探索,但这似乎是开发人员API。

感谢。

16/03/23 18:33:00 WARN TaskSetManager: Lost task 1.0 in stage 11.0 (TID 88, 192.168.213.53): java.lang.RuntimeException: Amit baby its exception time
    at com.yourcompany.custom.identifier.JavaRecoverableNetworkWordCount$1.call(JavaRecoverableNetworkWordCount.java:141)
    at com.yourcompany.custom.identifier.JavaRecoverableNetworkWordCount$1.call(JavaRecoverableNetworkWordCount.java:131)
    at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$fn$1$1.apply(JavaDStreamLike.scala:172)
    at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$fn$1$1.apply(JavaDStreamLike.scala:172)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:203)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

16/03/23 18:33:00 INFO TaskSetManager: Starting task 1.1 in stage 11.0 (TID 89, 192.168.213.53, NODE_LOCAL, 2535 bytes)
16/03/23 18:33:00 INFO TaskSetManager: Lost task 1.1 in stage 11.0 (TID 89) on executor 192.168.213.53: java.lang.RuntimeException (Amit baby its exception time) [duplicate 1]
16/03/23 18:33:00 INFO TaskSetManager: Starting task 1.2 in stage 11.0 (TID 90, 192.168.213.53, NODE_LOCAL, 2535 bytes)
16/03/23 18:33:00 INFO TaskSetManager: Lost task 1.2 in stage 11.0 (TID 90) on executor 192.168.213.53: java.lang.RuntimeException (Amit baby its exception time) [duplicate 2]
16/03/23 18:33:00 INFO TaskSetManager: Starting task 1.3 in stage 11.0 (TID 91, 192.168.213.53, NODE_LOCAL, 2535 bytes)
16/03/23 18:33:00 INFO TaskSetManager: Lost task 1.3 in stage 11.0 (TID 91) on executor 192.168.213.53: java.lang.RuntimeException (Amit baby its exception time) [duplicate 3]
16/03/23 18:33:00 ERROR TaskSetManager: Task 1 in stage 11.0 failed 4 times; aborting job
16/03/23 18:33:00 INFO TaskSchedulerImpl: Removed TaskSet 11.0, whose tasks have all completed, from pool 
16/03/23 18:33:00 INFO TaskSchedulerImpl: Cancelling stage 11

1 个答案:

答案 0 :(得分:3)

无法完成此操作。在任务中处理的数据通常不会长于其所属的作业。当阶段失败时,作业不再存在,并且数据已用于垃圾收集。没有提及它,所以你无法掌握它。

SparkListener 确实是 DeveloperAPI ,但这并不意味着您无法使用它。它仍然是一个公共API。这只是意味着它不能保证在Spark版本之间保持稳定。我们使用 SparkListener 可能是一年前的事,而且它实际上非常稳定。随意放手一搏。但我认为它不能解决你的问题。

但这是一个有效且有趣的想法。能够访问数据将有助于调试。您可以在Spark JIRA中输入功能请求。尽管如此,这并不是一件简单的事情。 Spark任务比你给它的用户代码更复杂。因此,即使任务的输入可用于调试,如何充分利用它也并非易事。无论如何,我认为值得一个对话!