Question

是否可以在代码中执行以下spark-submit脚本，然后获取由YARN分配的应用程序ID？

   bin/spark-submit 
--class com.my.application.XApp 
--master yarn-cluster --executor-memory 100m 
--num-executors 50 hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar 
1000

这是为了使用户能够通过REST API启动和停止作业。

我找到了，

https://spark.apache.org/docs/latest/api/java/org/apache/spark/launcher/SparkLauncher.html

 import org.apache.spark.launcher.SparkLauncher;

       public class MyLauncher {
         public static void main(String[] args) throws Exception {
           Process spark = new SparkLauncher()
             .setAppResource("/my/app.jar")
             .setMainClass("my.spark.app.Main")
             .setMaster("local")
             .setConf(SparkLauncher.DRIVER_MEMORY, "2g")
             .launch();
           spark.waitFor();
         }
       }

但我找不到获取应用程序ID的方法，看起来app.jar似乎必须在执行上面的代码之前预先构建？

Answer 1

是的，您的应用程序jar确实需要在这些情况下预先构建。看起来像Spark Job Server或IBM Spark Kernel可能更接近您想要的东西（尽管它们重用了Spark上下文）。

Answer 2

SparkLauncher只会提交您构建的应用程序。要获取应用程序ID，您需要访问应用程序jar中的SparkContext。

在您的示例中，您可以使用以下命令访问“/my/app.jar”中的应用程序ID（可能在“my.spark.app.Main”中）：

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
...
val sc = new SparkContext(new SparkConf())
sc.applicationId

当应用程序构建并以纱线群集模式提交时，此应用程序ID将是YARN应用程序ID。

See the Spark Scala API docs.

Spark 1.6（SPARK-8673）似乎支持访问已启动的应用程序。源自this test suite的Scala示例如下所示。

val handle = new SparkLauncher()
  ... // application configuration
  .setMaster("yarn-client")
  .startApplication()
try {
  handle.getAppId() should startWith ("application_")
  handle.stop()
} finally {
  handle.kill()
}

处理程序可能会添加到已启动的应用程序中，但会公开侦听器API，这是监视已启动应用程序的推荐方法。 See this pull request for details

Answer 3

Scala具有SparkContext.applicationId，它是Spark应用程序的唯一标识符。其格式取决于调度程序实现。（即如果本地火花应用程序类似'local-1433865536131'，如果YARN的情况类似'application_1433865536131_34483'）

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext

在Scala代码中运行spark-submit

3 个答案: