Spark Launcher waiting for job completion infinitely

时间:2015-07-31 20:04:22

标签: java apache-spark yarn spark-launcher

I am trying to submit a JAR with Spark job into the YARN cluster from Java code. I am using SparkLauncher to submit SparkPi example:

Process spark = new SparkLauncher()
    .setAppResource("C:\\spark-1.4.1-bin-hadoop2.6\\lib\\spark-examples-1.4.1-hadoop2.6.0.jar")
    .setMainClass("org.apache.spark.examples.SparkPi")
    .setMaster("yarn-cluster")
    .launch();
System.out.println("Waiting for finish...");
int exitCode = spark.waitFor();
System.out.println("Finished! Exit code:" + exitCode);

There are two problems:

  1. While submitting in "yarn-cluster" mode, the application is sucessfully submitted to YARN and executes successfully (it is visible in the YARN UI, reported as SUCCESS and pi is printed in the output). However, the submitting application is never notified that processing is finished - it hangs infinitely after printing "Waiting to finish..." The log of the container can be found here
  2. While submitting in "yarn-client" mode, the application does not appear in YARN UI and the submitting application hangs at "Waiting to finish..." When hanging code is killed, the application shows up in YARN UI and it is reported as SUCCESS, but the output is empty (pi is not printed out). The log of the container can be found here

I tried to execute the submitting application both with Oracle Java 7 and 8.

3 个答案:

答案 0 :(得分:17)

我在Spark邮件列表中得到了帮助。关键是在Process上读取/清除getInputStream和getErrorStream()。子进程可能会填满缓冲区并导致死锁 - 请参阅Oracle docs regarding Process。应该在不同的线程中读取流:

Process spark = new SparkLauncher()
    .setSparkHome("C:\\spark-1.4.1-bin-hadoop2.6")
    .setAppResource("C:\\spark-1.4.1-bin-hadoop2.6\\lib\\spark-examples-1.4.1-hadoop2.6.0.jar")
    .setMainClass("org.apache.spark.examples.SparkPi").setMaster("yarn-cluster").launch();

InputStreamReaderRunnable inputStreamReaderRunnable = new InputStreamReaderRunnable(spark.getInputStream(), "input");
Thread inputThread = new Thread(inputStreamReaderRunnable, "LogStreamReader input");
inputThread.start();

InputStreamReaderRunnable errorStreamReaderRunnable = new InputStreamReaderRunnable(spark.getErrorStream(), "error");
Thread errorThread = new Thread(errorStreamReaderRunnable, "LogStreamReader error");
errorThread.start();

System.out.println("Waiting for finish...");
int exitCode = spark.waitFor();
System.out.println("Finished! Exit code:" + exitCode);

其中InputStreamReaderRunnable类是:

public class InputStreamReaderRunnable implements Runnable {

    private BufferedReader reader;

    private String name;

    public InputStreamReaderRunnable(InputStream is, String name) {
        this.reader = new BufferedReader(new InputStreamReader(is));
        this.name = name;
    }

    public void run() {
        System.out.println("InputStream " + name + ":");
        try {
            String line = reader.readLine();
            while (line != null) {
                System.out.println(line);
                line = reader.readLine();
            }
            reader.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

答案 1 :(得分:8)

由于这是一篇旧帖子,我想添加一个更新,可能有助于谁曾阅读过这篇文章。在spark 1.6.0中,SparkLauncher类中增加了一些函数。这是:

def startApplication(listeners: <repeated...>[Listener]): SparkAppHandle

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.launcher.SparkLauncher

您可以运行应用程序而不需要为stdout添加额外的线程,并且stderr处理plush有一个很好的状态报告正在运行的应用程序。使用此代码:

  val env = Map(
      "HADOOP_CONF_DIR" -> hadoopConfDir,
      "YARN_CONF_DIR" -> yarnConfDir
    )
  val handler = new SparkLauncher(env.asJava)
      .setSparkHome(sparkHome)
      .setAppResource("Jar/location/.jar")
      .setMainClass("path.to.the.main.class")
      .setMaster("yarn-client")
      .setConf("spark.app.id", "AppID if you have one")
      .setConf("spark.driver.memory", "8g")
      .setConf("spark.akka.frameSize", "200")
      .setConf("spark.executor.memory", "2g")
      .setConf("spark.executor.instances", "32")
      .setConf("spark.executor.cores", "32")
      .setConf("spark.default.parallelism", "100")
      .setConf("spark.driver.allowMultipleContexts","true")
      .setVerbose(true)
      .startApplication()
println(handle.getAppId)
println(handle.getState)

如果火花应用程序成功,你可以继续征服状态。 有关Spark Launcher服务器如何在1.6.0中工作的信息。看到这个链接: https://github.com/apache/spark/blob/v1.6.0/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java

答案 2 :(得分:3)

我使用CountDownLatch实现,它按预期工作。 这适用于SparkLauncher 2.0.1版,它也适用于Yarn-cluster模式。

    ...
final CountDownLatch countDownLatch = new CountDownLatch(1);
SparkAppListener sparkAppListener = new SparkAppListener(countDownLatch);
SparkAppHandle appHandle = sparkLauncher.startApplication(sparkAppListener);
Thread sparkAppListenerThread = new Thread(sparkAppListener);
sparkAppListenerThread.start();
long timeout = 120;
countDownLatch.await(timeout, TimeUnit.SECONDS);    
    ...

private static class SparkAppListener implements SparkAppHandle.Listener, Runnable {
    private static final Log log = LogFactory.getLog(SparkAppListener.class);
    private final CountDownLatch countDownLatch;
    public SparkAppListener(CountDownLatch countDownLatch) {
        this.countDownLatch = countDownLatch;
    }
    @Override
    public void stateChanged(SparkAppHandle handle) {
        String sparkAppId = handle.getAppId();
        State appState = handle.getState();
        if (sparkAppId != null) {
            log.info("Spark job with app id: " + sparkAppId + ",\t State changed to: " + appState + " - "
                    + SPARK_STATE_MSG.get(appState));
        } else {
            log.info("Spark job's state changed to: " + appState + " - " + SPARK_STATE_MSG.get(appState));
        }
        if (appState != null && appState.isFinal()) {
            countDownLatch.countDown();
        }
    }
    @Override
    public void infoChanged(SparkAppHandle handle) {}
    @Override
    public void run() {}
}