奇怪的是org.apache.spark.SparkException:由于阶段失败,作业再次中止

时间:2014-12-02 09:13:13

标签: apache-spark apache-spark-mllib tf-idf

我正在尝试在独立模式下部署spark应用程序。在这个应用程序中,我正在使用tf-idf向量训练朴素贝叶斯分类器。

我以类似的方式写了这篇文章(Spark MLLib TFIDF implementation for LogisticRegression) 区别在于我对每个文档进行标记并对其进行标准化。

JavaRDD<Document> termDocsRdd = sc.parallelize(fileNameList).flatMap(new FlatMapFunction<String, Document>() {
        @Override
        public Iterable<Document> call(String fileName) 
        { 
            return Arrays.asList(parsingFunction(fileName)); 
        }        
    });

因此,Document的每个副本都有textField,其中包含规范化的文档文本作为字符串列表(单词列表)和labelField,其中包含文档标签为double。 parsingFunction没有像map或flatMap等任何Spark函数。所以它不包含任何数据分布函数。

当我在本地模式下启动应用程序时 - 它工作正常,并且在预测模式分类器中正确分类测试文档,但是当我尝试以独立模式启动它时 - 我遇到了一些麻烦 -

当我在一台机器上启动主节点和工作节点时 - 应用程序正常工作,但预测结果比本地模式更差。 当我在一台机器上启动master而在另一台机器上启动worker时 - 应用程序崩溃并出现以下错误:

14/12/02 11:19:17 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 0.0 (TID 3, fujitsu10.inevm.ru, PROCESS_LOCAL, 1298 bytes)
14/12/02 11:19:17 INFO scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on executor fujitsu10.inevm.ru: java.lang.NullPointerException (null) [duplicate 1]
14/12/02 11:19:17 INFO scheduler.TaskSetManager: Starting task 1.1 in stage 0.0 (TID 4, fujitsu10.inevm.ru, PROCESS_LOCAL, 1298 bytes)
14/12/02 11:19:17 INFO scheduler.TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) on executor fujitsu10.inevm.ru: java.lang.NullPointerException (null) [duplicate 2]
14/12/02 11:19:17 INFO scheduler.TaskSetManager: Starting task 2.1 in stage 0.0 (TID 5, fujitsu10.inevm.ru, PROCESS_LOCAL, 1298 bytes)
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 (TID 3) on executor fujitsu10.inevm.ru: java.lang.NullPointerException (null) [duplicate 3]
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Starting task 0.2 in stage 0.0 (TID 6, fujitsu10.inevm.ru, PROCESS_LOCAL, 1298 bytes)
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Lost task 2.1 in stage 0.0 (TID 5) on executor fujitsu10.inevm.ru: java.lang.NullPointerException (null) [duplicate 4]
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Starting task 2.2 in stage 0.0 (TID 7, fujitsu10.inevm.ru, PROCESS_LOCAL, 1298 bytes)
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Lost task 1.1 in stage 0.0 (TID 4) on executor fujitsu10.inevm.ru: java.lang.NullPointerException (null) [duplicate 5]
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Starting task 1.2 in stage 0.0 (TID 8, fujitsu10.inevm.ru, PROCESS_LOCAL, 1298 bytes)
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Lost task 0.2 in stage 0.0 (TID 6) on executor fujitsu10.inevm.ru: java.lang.NullPointerException (null) [duplicate 6]
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Starting task 0.3 in stage 0.0 (TID 9, fujitsu10.inevm.ru, PROCESS_LOCAL, 1298 bytes)
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Lost task 1.2 in stage 0.0 (TID 8) on executor fujitsu10.inevm.ru: java.lang.NullPointerException (null) [duplicate 7]
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Starting task 1.3 in stage 0.0 (TID 10, fujitsu10.inevm.ru, PROCESS_LOCAL, 1298 bytes)
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Lost task 2.2 in stage 0.0 (TID 7) on executor fujitsu10.inevm.ru: java.lang.NullPointerException (null) [duplicate 8]
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Starting task 2.3 in stage 0.0 (TID 11, fujitsu10.inevm.ru, PROCESS_LOCAL, 1298 bytes)
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Lost task 2.3 in stage 0.0 (TID 11) on executor fujitsu10.inevm.ru: java.lang.NullPointerException (null) [duplicate 9]
14/12/02 11:19:18 ERROR scheduler.TaskSetManager: Task 2 in stage 0.0 failed 4 times; aborting job
14/12/02 11:19:18 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
14/12/02 11:19:18 INFO scheduler.TaskSchedulerImpl: Stage 0 was cancelled
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Lost task 1.3 in stage 0.0 (TID 10) on executor fujitsu10.inevm.ru: java.lang.NullPointerException (null) [duplicate 10]
14/12/02 11:19:18 INFO scheduler.DAGScheduler: Failed to run reduce at RDDFunctions.scala:111
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost task 2.3 in stage 0.0 (TID 11, fujitsu10.inevm.ru): java.lang.NullPointerException: 
    maven.maven1.App$3.call(App.java:178)
    maven.maven1.App$3.call(App.java:1)
    org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:923)
    scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236)
    org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
    org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
    org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
    org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
    org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
    org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
    org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
    org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
    org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
    org.apache.spark.scheduler.Task.run(Task.scala:54)
    org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/12/02 11:19:18 INFO scheduler.TaskSetManager: Lost task 0.3 in stage 0.0 (TID 9) on executor fujitsu10.inevm.ru: java.lang.NullPointerException (null) [duplicate 11]
14/12/02 11:19:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 

在日志中我找到了:

14/12/02 11:19:20 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@fujitsu11:7077] -> [akka.tcp://sparkDriver@fujitsu11.inevm.ru:54481]: Error [Association failed with         [akka.tcp://sparkDriver@fujitsu11.inevm.ru:54481]] [
akka.remote.EndpointAssociationException: Association failed with     [akka.tcp://sparkDriver@fujitsu11.inevm.ru:54481]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection     refused: no further information: fujitsu11.inevm.ru/192.168.3.5:54481
]

我调试了应用程序并发现它在此代码之后崩溃了:

IDFModel idfModel = new IDF().fit(hashedData);

也许有人知道最近发生了什么?

谢谢。

P.S。我在Windows 7 64位上使用Spark 1.1.0。两台机器都有8核CPU和16 GB RAM。

0 个答案:

没有答案