将Apache Spark 2.0.1与jobserver 0.7.0一起部署为Standalone。
我有一个小工作来测试上下文是否可操作,因为有时上下文被杀死但我的服务器上的java进程仍然存在。所以我仔细检查上下文是否作为一个系统进程,如果可以调用一个Job,它将返回一些spark配置和java状态作为JSON格式化字符串。
public class TestJob extends VIQ_SparkJob {
@Override
public Object runJob(SparkContext jsc, Config jobConfig) {
getSparkSession(jsc);
String result = "{";
result += "\"AppName\":\"" + jsc.appName() + "\",";
result += "\"ApplicationID\":\"" + jsc.applicationId() + "\",";
result += "\"DeployMode\":\"" + jsc.deployMode() + "\",";
result += "\"ExecutorID\":\"" + jsc.env().executorId() + "\",";
scala.collection.immutable.Map<String, String> all = sparkSession.conf().getAll();
scala.collection.immutable.Set<String> keys = all.keySet();
for (scala.collection.Iterator<String> iterator = keys.iterator(); iterator.hasNext();) {
String next = iterator.next();
result += "\"" + next + "\":\"" + all.get(next).get() + "\",";
}
result += "\"JavaAvailableProcessors\":\"" + Runtime.getRuntime().availableProcessors() + "\",";
result += "\"JavaMaxMemory\":\"" + Runtime.getRuntime().maxMemory() + "\",";
result += "\"JavaTotalMemory\":\"" + Runtime.getRuntime().totalMemory() + "\",";
result += "\"JavaFreeMemory\":\"" + Runtime.getRuntime().freeMemory() + "\"";
final HotSpotDiagnosticMXBean hsdiag = ManagementFactory
.getPlatformMXBean(HotSpotDiagnosticMXBean.class);
if (hsdiag != null) {
List<VMOption> vmOptions = hsdiag.getDiagnosticOptions();
for (Iterator<VMOption> iterator = vmOptions.iterator(); iterator.hasNext();) {
VMOption next = iterator.next();
result += ",\"Java" + next.getName() + "\":\"" + next.getValue() + "\"";
}
}
result += "}";
return result;
}
}
我每隔60秒执行一次检查,工作正常直到上下文被杀死,我的spark-job-server.log上出现以下错误:
[2017-02-19 06:37:33,639] ERROR ka.actor.OneForOneStrategy [] [akka://JobServer/user/context-supervisor/application_analytics] - Futures timed out after [3 seconds]
java.util.concurrent.TimeoutException: Futures timed out after [3 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread$$anon$3.block(ThreadPoolBuilder.scala:169)
at scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3640)
at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.blockOn(ThreadPoolBuilder.scala:167)
at scala.concurrent.Await$.result(package.scala:190)
at spark.jobserver.JobManagerActor.startJobInternal(JobManagerActor.scala:219)
at spark.jobserver.JobManagerActor$$anonfun$wrappedReceive$1.applyOrElse(JobManagerActor.scala:157)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.ActorStack$$anonfun$receive$1.applyOrElse(ActorStack.scala:33)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.Slf4jLogging$$anonfun$receive$1$$anonfun$applyOrElse$1.apply$mcV$sp(Slf4jLogging.scala:25)
at spark.jobserver.common.akka.Slf4jLogging$class.spark$jobserver$common$akka$Slf4jLogging$$withAkkaSourceLogging(Slf4jLogging.scala:34)
at spark.jobserver.common.akka.Slf4jLogging$$anonfun$receive$1.applyOrElse(Slf4jLogging.scala:24)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.ActorMetrics$$anonfun$receive$1.applyOrElse(ActorMetrics.scala:23)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at spark.jobserver.common.akka.InstrumentedActor.aroundReceive(InstrumentedActor.scala:8)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[2017-02-19 06:37:33,639] ERROR .jobserver.JobManagerActor [] [] - About to restart actor due to exception:
java.util.concurrent.TimeoutException: Futures timed out after [3 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread$$anon$3.block(ThreadPoolBuilder.scala:169)
at scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3640)
at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.blockOn(ThreadPoolBuilder.scala:167)
at scala.concurrent.Await$.result(package.scala:190)
at spark.jobserver.JobManagerActor.startJobInternal(JobManagerActor.scala:219)
at spark.jobserver.JobManagerActor$$anonfun$wrappedReceive$1.applyOrElse(JobManagerActor.scala:157)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.ActorStack$$anonfun$receive$1.applyOrElse(ActorStack.scala:33)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.Slf4jLogging$$anonfun$receive$1$$anonfun$applyOrElse$1.apply$mcV$sp(Slf4jLogging.scala:25)
at spark.jobserver.common.akka.Slf4jLogging$class.spark$jobserver$common$akka$Slf4jLogging$$withAkkaSourceLogging(Slf4jLogging.scala:34)
在火花工人日志中,我看到工人杀死了遗嘱执行人。
17/02/20 00:09:17 INFO Worker: Asked to kill executor app-20170218095729-0000/0
17/02/20 00:09:17 INFO ExecutorRunner: Runner thread for executor app-20170218095729-0000/0 interrupted
17/02/20 00:09:17 INFO ExecutorRunner: Killing process!
17/02/20 00:09:18 INFO Worker: Executor app-20170218095729-0000/0 finished with state KILLED exitStatus 0
17/02/20 00:09:18 INFO Worker: Cleaning up local directories for application app-20170218095729-0000
17/02/20 00:09:18 INFO ExternalShuffleBlockResolver: Application app-20170218095729-0000 removed, cleanupLocalDirs = true
17/02/20 00:09:18 INFO ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=app-20170218095729-0000, execId=0}'s 1 local dirs
由于我有另一个在同一台服务器上运行的应用程序(我没有尝试内存问题),但有时处理器可能被其他应用程序广泛使用。通常这不是问题,因为作业服务器通常在白天使用,而其他应用程序在夜间执行,因此负载是平衡的。
所以我的想法是,通常问题与内存有关,所以我分配了每个进程所需的内存。但我认为,如果处理器被其他应用程序使用,那么它只会减慢作业执行速度,但不会使其崩溃。或者我错了吗?
执行者app-20170218095729-0000 / 0以州KILLED exitStatus 0 结束后是什么意思?