Question

我正在用马拉松计划程序测试我的工作，并观察到马拉松计划程序会在重新启动马拉松服务（systemctl restart marathon.service）时重新启动完成的作业。不知道我是否缺少任何阻止此行为的配置。我希望工作可以运行一次并完成。

测试设置- 我使用马拉松作为调度程序在mesos群集上运行作业。作业配置使用其带有力= true标志的rest api发布到马拉松中。

Job应该运行一次并完成。

作业json-

{
  "id": "/test-job",
  "cmd": "/bin/ls",
  "cpus": 0.25,
  "mem": 100,
  "disk": 100,
  "instances": 1,
  "acceptedResourceRoles": [
    "mesos-workers"
  ],
  "labels": {
    "MARATHON_SINGLE_INSTANCE_APP": "true",
  },
  "portDefinitions": [],
  "user": "nobody",
  "backoffSeconds": 2147483647,
  "maxLaunchDelaySeconds": 2147483647,
}

还通过在作业定义中添加其他参数以测试升级策略来测试重新启动马拉松-

  "upgradeStrategy": {
    "maximumOverCapacity": 0,
    "minimumHealthCapacity": 0
  }

任何帮助您确定可能出问题的地方，将不胜感激。

谢谢！

重新启动马拉松服务后的马拉松日志-

Sep 25 20:45:04 10.162.217.171 marathon[2801]: [2018-09-25 20:45:04,878] INFO  removing matcher ActorOfferMatcher(Actor[akka://marathon/user/launchQueue/1/0-test-job#203351593]) (mesosphere.marathon.core
Sep 25 20:45:04 10.162.217.171 marathon[2801]: [2018-09-25 20:45:04,891] INFO  Processing LaunchEphemeral(Instance(instance [test-job.marathon-e18878ba-c103-11e8-a594-12d685c81d52],AgentInfo(10.162.147.2
Sep 25 20:45:04 10.162.217.171 marathon[2801]: [2018-09-25 20:45:04,905] INFO  Finished processing 1bf99832-7f87-4609-b591-8261ed4739eb-O630667 from 10.162.147.203. Matched 1 ops after 2 passes. First 10: cpus(
Sep 25 20:45:04 10.162.217.171 marathon[2801]: [2018-09-25 20:45:04,948] WARN  The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead. (org.apache.curato
Sep 25 20:45:05 10.162.217.171 marathon[2801]: [2018-09-25 20:45:05,113] INFO  Received status update for task test-job.e18878ba-c103-11e8-a594-12d685c81d52: TASK_STARTING () (mesosphere.marathon.Maratho
Sep 25 20:45:05 10.162.217.171 marathon[2801]: [2018-09-25 20:45:05,145] INFO  Acknowledge status update for task test-job.e18878ba-c103-11e8-a594-12d685c81d52: TASK_STARTING () (mesosphere.marathon.core
Sep 25 20:45:05 10.162.217.171 marathon[2801]: [2018-09-25 20:45:05,149] INFO  Received status update for task test-job.e18878ba-c103-11e8-a594-12d685c81d52: TASK_RUNNING () (mesosphere.marathon.Marathon
Sep 25 20:45:05 10.162.217.171 marathon[2801]: [2018-09-25 20:45:05,163] INFO  Acknowledge status update for task test-job.e18878ba-c103-11e8-a594-12d685c81d52: TASK_RUNNING () (mesosphere.marathon.core.
Sep 25 20:45:05 10.162.217.171 marathon[2801]: [2018-09-25 20:45:05,405] INFO  Received status update for task test-job.e18878ba-c103-11e8-a594-12d685c81d52: TASK_FINISHED (Command exited with status 0)
Sep 25 20:45:05 10.162.217.171 marathon[2801]: [2018-09-25 20:45:05,409] INFO  all tasks of instance [test-job.marathon-e18878ba-c103-11e8-a594-12d685c81d52] are terminal, requesting to expunge (mesosphe
Sep 25 20:45:05 10.162.217.171 marathon[2801]: [2018-09-25 20:45:05,426] INFO  Removed app [/test-job] from tracker (mesosphere.marathon.core.task.tracker.InstanceTracker$InstancesBySpec:marathon-akka.ac
Sep 25 20:45:05 10.162.217.171 marathon[2801]: [2018-09-25 20:45:05,430] INFO  receiveInstanceUpdate: instance [test-job.marathon-e18878ba-c103-11e8-a594-12d685c81d52] was deleted (Finished) (mesosphere.
Sep 25 20:45:05 10.162.217.171 marathon[2801]: [2018-09-25 20:45:05,432] INFO  initiating a scale check for runSpec [/test-job] due to [instance [test-job.marathon-e18878ba-c103-11e8-a594-12d685c8
Sep 25 20:45:05 10.162.217.171 marathon[2801]: [2018-09-25 20:45:05,433] INFO  Acknowledge status update for task test-job.e18878ba-c103-11e8-a594-12d685c81d52: TASK_FINISHED (Command exited with status
Sep 25 20:45:05 10.162.217.171 marathon[2801]: [2018-09-25 20:45:05,436] INFO  Increasing delay. Task launch delay for [/test-job - 2018-09-24T21:51:25.894Z] is set to 24855 days 3 hours 14 minutes 7 sec
Sep 25 20:45:05 10.162.217.171 marathon[2801]: [2018-09-25 20:45:05,437] INFO  Need to scale /test-job from 0 up to 1 instances (mesosphere.marathon.SchedulerActions:scheduler-actions-thread-0)
Sep 25 20:45:05 10.162.217.171 marathon[2801]: [2018-09-25 20:45:05,446] INFO  Stopped InstanceLauncherActor for /test-job version 2018-09-24T21:51:25.894Z (mesosphere.marathon.core.launchqueue.impl.Task
Sep 25 20:45:05 10.162.217.171 marathon[2801]: [2018-09-25 20:45:05,450] WARN  Got unexpected terminated for runSpec /test-job: Actor[akka://marathon/user/launchQueue/1/0-test-job#203351593] (meso

Answer 1

Marathon应该是长时间运行的应用程序/进程的框架（因此称为Marathon）。换句话说，它不适用于计划的或一次性的作业/过程。为了简化起见，基本上Marathon在每个应用程序的无限循环中执行以下操作

IF    number of instances running != number of instance desired
THEN  launch/kill instances to make sure number of instances running == number of instance desired
ELSE  do nothing

因此，无论是否重新启动马拉松，如果上一个任务完成，它将开始一个新任务

我建议您使用一种旨在启动作业的框架代替：

计时：https://mesos.github.io/chronos/

库克：https://github.com/twosigma/Cook

节拍器：https://github.com/dcos/metronome

为什么Marathon Scheduler在重新启动Marathon服务后会重新启动部署？

1 个答案: