Question

我有一种情况，我想在Spark中的每个worker上执行系统进程。我希望这个过程能够在每台机器上运行一次。具体来说，这个过程启动一个守护进程，它需要在我的程序的其余部分执行之前运行。理想情况下，这应该在我读取任何数据之前执行。

我使用Spark 2.0.2并使用动态分配。

Answer 1

您可以通过lazy val和Spark广播的组合来实现这一目标。它将是下面的东西。（没有编译下面的代码，你可能需要改变一些事情）

object ProcessManager {
  lazy val start = // start your process here.
}

在进行任何转换之前，您可以在应用程序的开头广播此对象。

val pm = sc.broadcast(ProcessManager)

现在，您可以像转换任何其他广播变量一样访问转换中的此对象，并调用lazy val。

rdd.mapPartition(itr => {
  pm.value.start
  // Other stuff here.
}

Answer 2

带有静态初始化的object可以调用你的系统进程。

object SparkStandIn extends App {
  object invokeSystemProcess {
    import sys.process._
    val errorCode = "echo Whatever you put in this object should be executed once per jvm".!

    def doIt(): Unit = {
      // this object will construct once per jvm, but objects are lazy in
      // another way to make sure instantiation happens is to check that the errorCode does not represent an error
    }
  }
  invokeSystemProcess.doIt()
  invokeSystemProcess.doIt() // even if doIt is invoked multiple times, the static initialization happens once
}

Answer 3

针对特定用例的特定答案，我有一个包含50个节点的集群，我想知道哪些节点设置了CET时区：

(1 until 100).toSeq.toDS.
mapPartitions(itr => {
        sys.process.Process(
                Seq("bash", "-c", "echo $(hostname && date)")
        ).
        lines.
        toIterator
}).
collect().
filter(_.contains(" CET ")).
distinct.
sorted.
foreach(println)

请注意，我不保证100％会为每个节点提供一个分区，因此即使在具有50个节点的集群中使用100个元素的数据集（如上例所示），该命令也可能不会在每个节点上运行

是否可以对Apache Spark中的所有工作程序执行命令？

3 个答案: